Repository: alteryx/featuretools
Branch: main
Commit: 938a0f6ccb98
Files: 501
Total size: 2.3 MB

Directory structure:
gitextract_b07mgx0i/

├── .codecov.yml
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── blank_issue.md
│   │   ├── bug_report.md
│   │   ├── config.yml
│   │   ├── documentation_improvement.md
│   │   └── feature_request.md
│   ├── auto_assign.yml
│   └── workflows/
│       ├── auto_approve_dependency_PRs.yaml
│       ├── broken_link_check.yaml
│       ├── build_docs.yaml
│       ├── create_feedstock_pr.yaml
│       ├── install_test.yaml
│       ├── kickoff_evalml_unit_tests.yaml
│       ├── latest_dependency_checker.yaml
│       ├── lint_check.yaml
│       ├── minimum_dependency_checker.yaml
│       ├── performance-check.yaml
│       ├── pull_request_check.yaml
│       ├── release.yaml
│       ├── release_notes_updated.yaml
│       ├── test_without_test_dependencies.yaml
│       ├── tests_with_latest_deps.yaml
│       ├── tests_with_minimum_deps.yaml
│       └── tests_with_woodwork_main_branch.yaml
├── .gitignore
├── .pre-commit-config.yaml
├── .readthedocs.yaml
├── LICENSE
├── Makefile
├── README.md
├── contributing.md
├── docs/
│   ├── Makefile
│   ├── backport_release.md
│   ├── make.bat
│   ├── notebook_version_standardizer.py
│   ├── pull_request_template.md
│   └── source/
│       ├── _static/
│       │   └── style.css
│       ├── api_reference.rst
│       ├── conf.py
│       ├── getting_started/
│       │   ├── afe.ipynb
│       │   ├── getting_started_index.rst
│       │   ├── handling_time.ipynb
│       │   ├── primitives.ipynb
│       │   ├── using_entitysets.ipynb
│       │   └── woodwork_types.ipynb
│       ├── guides/
│       │   ├── advanced_custom_primitives.ipynb
│       │   ├── deployment.ipynb
│       │   ├── feature_descriptions.ipynb
│       │   ├── feature_selection.ipynb
│       │   ├── guides_index.rst
│       │   ├── performance.ipynb
│       │   ├── specifying_primitive_options.ipynb
│       │   ├── sql_database_integration.ipynb
│       │   ├── time_series.ipynb
│       │   └── tuning_dfs.ipynb
│       ├── index.ipynb
│       ├── install.md
│       ├── release_notes.rst
│       ├── resources/
│       │   ├── ecosystem.rst
│       │   ├── frequently_asked_questions.ipynb
│       │   ├── help.rst
│       │   ├── resources_index.rst
│       │   ├── transition_to_ft_v1.0.ipynb
│       │   └── usage_tips/
│       │       ├── glossary.rst
│       │       └── limitations.rst
│       ├── set-headers.py
│       ├── setup.py
│       └── templates/
│           └── layout.html
├── featuretools/
│   ├── __init__.py
│   ├── __main__.py
│   ├── computational_backends/
│   │   ├── __init__.py
│   │   ├── api.py
│   │   ├── calculate_feature_matrix.py
│   │   ├── feature_set.py
│   │   ├── feature_set_calculator.py
│   │   └── utils.py
│   ├── config_init.py
│   ├── demo/
│   │   ├── __init__.py
│   │   ├── api.py
│   │   ├── flight.py
│   │   ├── mock_customer.py
│   │   ├── retail.py
│   │   └── weather.py
│   ├── entityset/
│   │   ├── __init__.py
│   │   ├── api.py
│   │   ├── deserialize.py
│   │   ├── entityset.py
│   │   ├── relationship.py
│   │   ├── serialize.py
│   │   └── timedelta.py
│   ├── exceptions.py
│   ├── feature_base/
│   │   ├── __init__.py
│   │   ├── api.py
│   │   ├── cache.py
│   │   ├── feature_base.py
│   │   ├── feature_descriptions.py
│   │   ├── feature_visualizer.py
│   │   ├── features_deserializer.py
│   │   ├── features_serializer.py
│   │   └── utils.py
│   ├── feature_discovery/
│   │   ├── FeatureCollection.py
│   │   ├── LiteFeature.py
│   │   ├── __init__.py
│   │   ├── convertors.py
│   │   ├── feature_discovery.py
│   │   ├── type_defs.py
│   │   └── utils.py
│   ├── primitives/
│   │   ├── __init__.py
│   │   ├── base/
│   │   │   ├── __init__.py
│   │   │   ├── aggregation_primitive_base.py
│   │   │   ├── primitive_base.py
│   │   │   └── transform_primitive_base.py
│   │   ├── options_utils.py
│   │   ├── standard/
│   │   │   ├── __init__.py
│   │   │   ├── aggregation/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── all_primitive.py
│   │   │   │   ├── any_primitive.py
│   │   │   │   ├── average_count_per_unique.py
│   │   │   │   ├── avg_time_between.py
│   │   │   │   ├── count.py
│   │   │   │   ├── count_above_mean.py
│   │   │   │   ├── count_below_mean.py
│   │   │   │   ├── count_greater_than.py
│   │   │   │   ├── count_inside_nth_std.py
│   │   │   │   ├── count_inside_range.py
│   │   │   │   ├── count_less_than.py
│   │   │   │   ├── count_outside_nth_std.py
│   │   │   │   ├── count_outside_range.py
│   │   │   │   ├── date_first_event.py
│   │   │   │   ├── entropy.py
│   │   │   │   ├── first.py
│   │   │   │   ├── first_last_time_delta.py
│   │   │   │   ├── has_no_duplicates.py
│   │   │   │   ├── is_monotonically_decreasing.py
│   │   │   │   ├── is_monotonically_increasing.py
│   │   │   │   ├── is_unique.py
│   │   │   │   ├── kurtosis.py
│   │   │   │   ├── last.py
│   │   │   │   ├── max_consecutive_false.py
│   │   │   │   ├── max_consecutive_negatives.py
│   │   │   │   ├── max_consecutive_positives.py
│   │   │   │   ├── max_consecutive_true.py
│   │   │   │   ├── max_consecutive_zeros.py
│   │   │   │   ├── max_count.py
│   │   │   │   ├── max_min_delta.py
│   │   │   │   ├── max_primitive.py
│   │   │   │   ├── mean.py
│   │   │   │   ├── median.py
│   │   │   │   ├── median_count.py
│   │   │   │   ├── min_count.py
│   │   │   │   ├── min_primitive.py
│   │   │   │   ├── mode.py
│   │   │   │   ├── n_most_common.py
│   │   │   │   ├── n_most_common_frequency.py
│   │   │   │   ├── n_unique_days.py
│   │   │   │   ├── n_unique_days_of_calendar_year.py
│   │   │   │   ├── n_unique_days_of_month.py
│   │   │   │   ├── n_unique_months.py
│   │   │   │   ├── n_unique_weeks.py
│   │   │   │   ├── num_consecutive_greater_mean.py
│   │   │   │   ├── num_consecutive_less_mean.py
│   │   │   │   ├── num_false_since_last_true.py
│   │   │   │   ├── num_peaks.py
│   │   │   │   ├── num_true.py
│   │   │   │   ├── num_true_since_last_false.py
│   │   │   │   ├── num_unique.py
│   │   │   │   ├── num_zero_crossings.py
│   │   │   │   ├── percent_true.py
│   │   │   │   ├── percent_unique.py
│   │   │   │   ├── skew.py
│   │   │   │   ├── std.py
│   │   │   │   ├── sum_primitive.py
│   │   │   │   ├── time_since_first.py
│   │   │   │   ├── time_since_last.py
│   │   │   │   ├── time_since_last_false.py
│   │   │   │   ├── time_since_last_max.py
│   │   │   │   ├── time_since_last_min.py
│   │   │   │   ├── time_since_last_true.py
│   │   │   │   ├── trend.py
│   │   │   │   └── variance.py
│   │   │   └── transform/
│   │   │       ├── __init__.py
│   │   │       ├── absolute_diff.py
│   │   │       ├── binary/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── add_numeric.py
│   │   │       │   ├── add_numeric_scalar.py
│   │   │       │   ├── and_primitive.py
│   │   │       │   ├── divide_by_feature.py
│   │   │       │   ├── divide_numeric.py
│   │   │       │   ├── divide_numeric_scalar.py
│   │   │       │   ├── equal.py
│   │   │       │   ├── equal_scalar.py
│   │   │       │   ├── greater_than.py
│   │   │       │   ├── greater_than_equal_to.py
│   │   │       │   ├── greater_than_equal_to_scalar.py
│   │   │       │   ├── greater_than_scalar.py
│   │   │       │   ├── less_than.py
│   │   │       │   ├── less_than_equal_to.py
│   │   │       │   ├── less_than_equal_to_scalar.py
│   │   │       │   ├── less_than_scalar.py
│   │   │       │   ├── modulo_by_feature.py
│   │   │       │   ├── modulo_numeric.py
│   │   │       │   ├── modulo_numeric_scalar.py
│   │   │       │   ├── multiply_boolean.py
│   │   │       │   ├── multiply_numeric.py
│   │   │       │   ├── multiply_numeric_boolean.py
│   │   │       │   ├── multiply_numeric_scalar.py
│   │   │       │   ├── not_equal.py
│   │   │       │   ├── not_equal_scalar.py
│   │   │       │   ├── or_primitive.py
│   │   │       │   ├── scalar_subtract_numeric_feature.py
│   │   │       │   ├── subtract_numeric.py
│   │   │       │   └── subtract_numeric_scalar.py
│   │   │       ├── cumulative/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── cum_count.py
│   │   │       │   ├── cum_max.py
│   │   │       │   ├── cum_mean.py
│   │   │       │   ├── cum_min.py
│   │   │       │   ├── cum_sum.py
│   │   │       │   ├── cumulative_time_since_last_false.py
│   │   │       │   └── cumulative_time_since_last_true.py
│   │   │       ├── datetime/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── age.py
│   │   │       │   ├── date_to_holiday.py
│   │   │       │   ├── date_to_timezone.py
│   │   │       │   ├── day.py
│   │   │       │   ├── day_of_year.py
│   │   │       │   ├── days_in_month.py
│   │   │       │   ├── diff_datetime.py
│   │   │       │   ├── distance_to_holiday.py
│   │   │       │   ├── hour.py
│   │   │       │   ├── is_federal_holiday.py
│   │   │       │   ├── is_first_week_of_month.py
│   │   │       │   ├── is_leap_year.py
│   │   │       │   ├── is_lunch_time.py
│   │   │       │   ├── is_month_end.py
│   │   │       │   ├── is_month_start.py
│   │   │       │   ├── is_quarter_end.py
│   │   │       │   ├── is_quarter_start.py
│   │   │       │   ├── is_weekend.py
│   │   │       │   ├── is_working_hours.py
│   │   │       │   ├── is_year_end.py
│   │   │       │   ├── is_year_start.py
│   │   │       │   ├── minute.py
│   │   │       │   ├── month.py
│   │   │       │   ├── part_of_day.py
│   │   │       │   ├── quarter.py
│   │   │       │   ├── season.py
│   │   │       │   ├── second.py
│   │   │       │   ├── time_since.py
│   │   │       │   ├── time_since_previous.py
│   │   │       │   ├── utils.py
│   │   │       │   ├── week.py
│   │   │       │   ├── weekday.py
│   │   │       │   └── year.py
│   │   │       ├── email/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── email_address_to_domain.py
│   │   │       │   └── is_free_email_domain.py
│   │   │       ├── exponential/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── exponential_weighted_average.py
│   │   │       │   ├── exponential_weighted_std.py
│   │   │       │   └── exponential_weighted_variance.py
│   │   │       ├── file_extension.py
│   │   │       ├── full_name_to_first_name.py
│   │   │       ├── full_name_to_last_name.py
│   │   │       ├── full_name_to_title.py
│   │   │       ├── is_in.py
│   │   │       ├── is_null.py
│   │   │       ├── latlong/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── cityblock_distance.py
│   │   │       │   ├── geomidpoint.py
│   │   │       │   ├── haversine.py
│   │   │       │   ├── is_in_geobox.py
│   │   │       │   ├── latitude.py
│   │   │       │   ├── longitude.py
│   │   │       │   └── utils.py
│   │   │       ├── natural_language/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── constants.py
│   │   │       │   ├── count_string.py
│   │   │       │   ├── mean_characters_per_word.py
│   │   │       │   ├── median_word_length.py
│   │   │       │   ├── num_characters.py
│   │   │       │   ├── num_unique_separators.py
│   │   │       │   ├── num_words.py
│   │   │       │   ├── number_of_common_words.py
│   │   │       │   ├── number_of_hashtags.py
│   │   │       │   ├── number_of_mentions.py
│   │   │       │   ├── number_of_unique_words.py
│   │   │       │   ├── number_of_words_in_quotes.py
│   │   │       │   ├── punctuation_count.py
│   │   │       │   ├── title_word_count.py
│   │   │       │   ├── total_word_length.py
│   │   │       │   ├── upper_case_count.py
│   │   │       │   ├── upper_case_word_count.py
│   │   │       │   └── whitespace_count.py
│   │   │       ├── not_primitive.py
│   │   │       ├── nth_week_of_month.py
│   │   │       ├── numeric/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── absolute.py
│   │   │       │   ├── cosine.py
│   │   │       │   ├── diff.py
│   │   │       │   ├── natural_logarithm.py
│   │   │       │   ├── negate.py
│   │   │       │   ├── percentile.py
│   │   │       │   ├── rate_of_change.py
│   │   │       │   ├── same_as_previous.py
│   │   │       │   ├── sine.py
│   │   │       │   ├── square_root.py
│   │   │       │   └── tangent.py
│   │   │       ├── percent_change.py
│   │   │       ├── postal/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── one_digit_postal_code.py
│   │   │       │   └── two_digit_postal_code.py
│   │   │       ├── savgol_filter.py
│   │   │       ├── time_series/
│   │   │       │   ├── __init__.py
│   │   │       │   ├── expanding/
│   │   │       │   │   ├── __init__.py
│   │   │       │   │   ├── expanding_count.py
│   │   │       │   │   ├── expanding_max.py
│   │   │       │   │   ├── expanding_mean.py
│   │   │       │   │   ├── expanding_min.py
│   │   │       │   │   ├── expanding_std.py
│   │   │       │   │   └── expanding_trend.py
│   │   │       │   ├── lag.py
│   │   │       │   ├── numeric_lag.py
│   │   │       │   ├── rolling_count.py
│   │   │       │   ├── rolling_max.py
│   │   │       │   ├── rolling_mean.py
│   │   │       │   ├── rolling_min.py
│   │   │       │   ├── rolling_outlier_count.py
│   │   │       │   ├── rolling_std.py
│   │   │       │   ├── rolling_trend.py
│   │   │       │   └── utils.py
│   │   │       └── url/
│   │   │           ├── __init__.py
│   │   │           ├── url_to_domain.py
│   │   │           ├── url_to_protocol.py
│   │   │           └── url_to_tld.py
│   │   └── utils.py
│   ├── selection/
│   │   ├── __init__.py
│   │   ├── api.py
│   │   └── selection.py
│   ├── synthesis/
│   │   ├── __init__.py
│   │   ├── api.py
│   │   ├── deep_feature_synthesis.py
│   │   ├── dfs.py
│   │   ├── encode_features.py
│   │   ├── get_valid_primitives.py
│   │   └── utils.py
│   ├── tests/
│   │   ├── __init__.py
│   │   ├── computational_backend/
│   │   │   ├── __init__.py
│   │   │   ├── test_calculate_feature_matrix.py
│   │   │   ├── test_feature_set.py
│   │   │   ├── test_feature_set_calculator.py
│   │   │   └── test_utils.py
│   │   ├── config_tests/
│   │   │   ├── __init__.py
│   │   │   └── test_config.py
│   │   ├── conftest.py
│   │   ├── demo_tests/
│   │   │   ├── __init__.py
│   │   │   └── test_demo_data.py
│   │   ├── entityset_tests/
│   │   │   ├── __init__.py
│   │   │   ├── test_es.py
│   │   │   ├── test_es_metadata.py
│   │   │   ├── test_last_time_index.py
│   │   │   ├── test_plotting.py
│   │   │   ├── test_relationship.py
│   │   │   ├── test_serialization.py
│   │   │   ├── test_timedelta.py
│   │   │   └── test_ww_es.py
│   │   ├── entry_point_tests/
│   │   │   ├── __init__.py
│   │   │   ├── add-ons/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── featuretools_plugin/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── featuretools_plugin/
│   │   │   │   │   │   └── __init__.py
│   │   │   │   │   └── setup.py
│   │   │   │   └── featuretools_primitives/
│   │   │   │       ├── __init__.py
│   │   │   │       ├── featuretools_primitives/
│   │   │   │       │   ├── __init__.py
│   │   │   │       │   ├── existing_primitive.py
│   │   │   │       │   ├── invalid_primitive.py
│   │   │   │       │   └── new_primitive.py
│   │   │   │       └── setup.py
│   │   │   ├── test_plugin.py
│   │   │   ├── test_primitives.py
│   │   │   └── utils.py
│   │   ├── feature_discovery/
│   │   │   ├── __init__.py
│   │   │   ├── test_convertors.py
│   │   │   ├── test_feature_collection.py
│   │   │   ├── test_feature_discovery.py
│   │   │   └── test_type_defs.py
│   │   ├── primitive_tests/
│   │   │   ├── __init__.py
│   │   │   ├── aggregation_primitive_tests/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── test_agg_primitives.py
│   │   │   │   ├── test_count_aggregation_primitives.py
│   │   │   │   ├── test_max_consecutive.py
│   │   │   │   ├── test_num_consecutive.py
│   │   │   │   ├── test_percent_true.py
│   │   │   │   ├── test_rolling_primitive.py
│   │   │   │   └── test_time_since.py
│   │   │   ├── bad_primitive_files/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── multiple_primitives.py
│   │   │   │   └── no_primitives.py
│   │   │   ├── natural_language_primitives_tests/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── test_count_string.py
│   │   │   │   ├── test_mean_characters_per_word.py
│   │   │   │   ├── test_median_word_length.py
│   │   │   │   ├── test_natural_language_primitives_terminate.py
│   │   │   │   ├── test_num_characters.py
│   │   │   │   ├── test_num_unique_separators.py
│   │   │   │   ├── test_num_words.py
│   │   │   │   ├── test_number_of_common_words.py
│   │   │   │   ├── test_number_of_hashtags.py
│   │   │   │   ├── test_number_of_mentions.py
│   │   │   │   ├── test_number_of_unique_words.py
│   │   │   │   ├── test_number_of_words_in_quotes.py
│   │   │   │   ├── test_punctuation_count.py
│   │   │   │   ├── test_title_word_count.py
│   │   │   │   ├── test_total_word_length.py
│   │   │   │   ├── test_upper_case_count.py
│   │   │   │   ├── test_upper_case_word_count.py
│   │   │   │   └── test_whitespace_count.py
│   │   │   ├── primitives_to_install/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── custom_max.py
│   │   │   │   ├── custom_mean.py
│   │   │   │   └── custom_sum.py
│   │   │   ├── test_absolute_diff.py
│   │   │   ├── test_agg_feats.py
│   │   │   ├── test_all_primitive_docstrings.py
│   │   │   ├── test_direct_features.py
│   │   │   ├── test_feature_base.py
│   │   │   ├── test_feature_descriptions.py
│   │   │   ├── test_feature_serialization.py
│   │   │   ├── test_feature_utils.py
│   │   │   ├── test_feature_visualizer.py
│   │   │   ├── test_features_deserializer.py
│   │   │   ├── test_features_serializer.py
│   │   │   ├── test_groupby_transform_primitives.py
│   │   │   ├── test_identity_features.py
│   │   │   ├── test_overrides.py
│   │   │   ├── test_primitive_base.py
│   │   │   ├── test_primitive_utils.py
│   │   │   ├── test_rolling_primitive_utils.py
│   │   │   ├── test_transform_features.py
│   │   │   ├── transform_primitive_tests/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── test_cumulative_time_since.py
│   │   │   │   ├── test_datetoholiday_primitive.py
│   │   │   │   ├── test_distancetoholiday_primitive.py
│   │   │   │   ├── test_expanding_primitives.py
│   │   │   │   ├── test_exponential_primitives.py
│   │   │   │   ├── test_full_name_primitives.py
│   │   │   │   ├── test_is_federal_holiday.py
│   │   │   │   ├── test_latlong_primitives.py
│   │   │   │   ├── test_percent_change.py
│   │   │   │   ├── test_percent_unique.py
│   │   │   │   ├── test_postal_primitives.py
│   │   │   │   ├── test_same_as_previous.py
│   │   │   │   ├── test_savgol_filter.py
│   │   │   │   ├── test_season.py
│   │   │   │   └── test_transform_primitive.py
│   │   │   └── utils.py
│   │   ├── profiling/
│   │   │   ├── __init__.py
│   │   │   └── dfs_profile.py
│   │   ├── requirement_files/
│   │   │   ├── latest_requirements.txt
│   │   │   ├── minimum_core_requirements.txt
│   │   │   ├── minimum_dask_requirements.txt
│   │   │   └── minimum_test_requirements.txt
│   │   ├── selection/
│   │   │   ├── __init__.py
│   │   │   └── test_selection.py
│   │   ├── synthesis/
│   │   │   ├── __init__.py
│   │   │   ├── test_deep_feature_synthesis.py
│   │   │   ├── test_dfs_method.py
│   │   │   ├── test_encode_features.py
│   │   │   └── test_get_valid_primitives.py
│   │   ├── test_version.py
│   │   ├── testing_utils/
│   │   │   ├── __init__.py
│   │   │   ├── cluster.py
│   │   │   ├── es_utils.py
│   │   │   ├── features.py
│   │   │   ├── generate_fake_dataframe.py
│   │   │   └── mock_ds.py
│   │   └── utils_tests/
│   │       ├── __init__.py
│   │       ├── test_config.py
│   │       ├── test_description_utils.py
│   │       ├── test_entry_point.py
│   │       ├── test_gen_utils.py
│   │       ├── test_recommend_primitives.py
│   │       ├── test_time_utils.py
│   │       ├── test_trie.py
│   │       └── test_utils_info.py
│   ├── utils/
│   │   ├── __init__.py
│   │   ├── api.py
│   │   ├── common_tld_utils.py
│   │   ├── description_utils.py
│   │   ├── entry_point.py
│   │   ├── gen_utils.py
│   │   ├── plot_utils.py
│   │   ├── recommend_primitives.py
│   │   ├── s3_utils.py
│   │   ├── schema_utils.py
│   │   ├── time_utils.py
│   │   ├── trie.py
│   │   ├── utils_info.py
│   │   └── wrangle.py
│   └── version.py
├── pyproject.toml
└── release.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .codecov.yml
================================================
codecov:
    notify:
        after_n_builds: 5


================================================
FILE: .github/ISSUE_TEMPLATE/blank_issue.md
================================================
---
name: Blank Issue
about: Create a blank issue
title: ''
labels: ''
assignees: ''

---


================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.md
================================================
---
name: Bug Report
about: Create a bug report to help us improve Featuretools
title: ''
labels: 'bug'
assignees: ''

---

[A clear and concise description of what the bug is.]

#### Code Sample, a copy-pastable example to reproduce your bug.

```python
# Your code here

```

#### Output of ``featuretools.show_info()``

<details>

[paste the output of ``featuretools.show_info()`` here below this line]

</details>


================================================
FILE: .github/ISSUE_TEMPLATE/config.yml
================================================
blank_issues_enabled: true
contact_links:
  - name: General Technical Question
    about: "If you have a question like *How should I create my EntitySet?* you can ask on StackOverflow using the #featuretools tag."
    url: https://stackoverflow.com/questions/tagged/featuretools
  - name: Real-time chat
    url: https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA
    about: "If you want to meet others in the community and chat about all things Alteryx OSS then check out our Slack."


================================================
FILE: .github/ISSUE_TEMPLATE/documentation_improvement.md
================================================
---
name: Documentation Improvement
about: Suggest an idea for improving the documentation
title: ''
labels: 'documentation'
assignees: ''

---

[a description of what documentation you believe needs to be fixed/improved]


================================================
FILE: .github/ISSUE_TEMPLATE/feature_request.md
================================================
---
name: Feature Request
about: Suggest an idea for this project
title: ''
labels: 'new feature'
assignees: ''

---

- As a [user/developer], I wish I could use Featuretools to ...

#### Code Example

```python
# Your code here, if applicable

```


================================================
FILE: .github/auto_assign.yml
================================================
# Set to author to set pr creator as assignee
addAssignees: author


================================================
FILE: .github/workflows/auto_approve_dependency_PRs.yaml
================================================
name: Auto Approve Dependency PRs
on:
  schedule:
      - cron: '*/30 * * * *'
  workflow_dispatch:
  workflow_run:
    workflows: ["Unit Tests - Latest Dependencies", "Unit Tests - 3.9 Minimum Dependencies"]
    branches:
      - 'latest-dep-update-[a-f0-9]+'
      - 'min-dep-update-[a-f0-9]+'
    types:
      - completed
jobs:
  build:
    if: ${{ github.repository_owner == 'alteryx' }}
    runs-on: ubuntu-latest
    steps:
      - name: Find dependency PRs
        id: find_prs
        run: |
          gh auth status
          gh pr list --repo "${{ github.repository }}" --assignee "machineFL" --base main --state open --search "status:success review:required" --limit 1 --json number > dep_PRs_waiting_approval.json
          dep_pull_request=$(cat dep_PRs_waiting_approval.json | grep -Eo "[0-9]*")
          echo ::set-output name=dep_pull_request::${dep_pull_request}
        env:
          GITHUB_TOKEN: ${{ secrets.AUTO_APPROVE_TOKEN }}
      - name: Approve dependency PRs and enable auto-merge
        if: ${{ steps.find_prs.outputs.dep_pull_request > 1 }}
        run: |
          gh pr review --repo "${{ github.repository }}" --comment --body "auto approve" ${{ steps.find_prs.outputs.dep_pull_request }}
          gh pr review --repo "${{ github.repository }}" --approve ${{ steps.find_prs.outputs.dep_pull_request }}
          gh pr merge --repo "${{ github.repository }}" --auto --squash --delete-branch ${{ steps.find_prs.outputs.dep_pull_request }}
        env:
          GITHUB_TOKEN: ${{ secrets.AUTO_APPROVE_TOKEN }}


================================================
FILE: .github/workflows/broken_link_check.yaml
================================================
name: Broken link check
on:
  workflow_dispatch:
  schedule:
    - cron: "* * * * 1"

jobs:
  my-broken-link-checker:
    name: Check for broken links
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
    steps:
      - name: Check for broken links
        uses: ruzickap/action-my-broken-link-checker@v2
        with:
          url: https://featuretools.alteryx.com/en/latest/
          cmd_params: '--max-connections=10 --color=always --ignore-fragments --buffer-size=8192 --skip-tls-verification --exclude="(twitter|github|cloudflare|featuretools\\.alteryx\\.com\\/en\\/(stable|main|v.+).*)"'
      - name: Add to job output
        run: echo "${{steps.link-report.outputs.result}}" >> $GITHUB_STEP_SUMMARY


================================================
FILE: .github/workflows/build_docs.yaml
================================================
name: Build Docs
on:
  pull_request:
    types: [opened, synchronize]
  push:
    branches:
      - main
  workflow_dispatch:
env:
  PYARROW_IGNORE_TIMEZONE: 1
  JAVA_HOME: "/usr/lib/jvm/java-11-openjdk-amd64"
jobs:
  build_docs:
    name: ${{ matrix.python_version }} build docs
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python_version: ["3.9", "3.10", "3.11", "3.12"]
    steps:
      - name: Checkout repository
        uses: actions/checkout@v3
        with:
          ref: ${{ github.event.pull_request.head.ref }}
          repository: ${{ github.event.pull_request.head.repo.full_name }}
      - name: Set up python ${{ matrix.python_version }}
        uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python_version }}
          cache: 'pip' 
          cache-dependency-path: 'pyproject.toml'
      - uses: actions/cache@v3
        id: cache
        with:
          path: ${{ env.pythonLocation }} 
          key: ${{ matrix.python_version }}-docs-${{ env.pythonLocation }}-${{ hashFiles('**/pyproject.toml') }}-v01
      - name: Build featuretools package
        run: |
          make package
      - name: Install complete version of featuretools from sdist (not using cache)
        if: steps.cache.outputs.cache-hit != 'true'
        run: |
          python -m pip install "unpacked_sdist/[dev]"
      - name: Install complete version of featuretools from sdist (using cache)
        if: steps.cache.outputs.cache-hit == 'true'
        run: |
          python -m pip install "unpacked_sdist/[dev]" --no-deps
      - name: Install apt packages
        run: |
          sudo apt update
          sudo apt install -y pandoc
          sudo apt install -y graphviz
          python -m pip check
      - name: Build docs
        run: make -C docs/ -e "SPHINXOPTS=-W -j auto" clean html


================================================
FILE: .github/workflows/create_feedstock_pr.yaml
================================================
on:
  workflow_dispatch:
    inputs:
      version:
        description: 'released PyPI version to use (ex - v1.11.1)'
        required: true

name: Create Feedstock PR
jobs:
  create_feedstock_pr:
    name: Create Feedstock PR
    runs-on: ubuntu-latest
    steps:
      - name: Checkout inputted version
        uses: actions/checkout@v3
        with:
          repository: ${{ github.event.pull_request.head.repo.full_name }}
          ref: ${{ github.event.inputs.version }}
          path: "./featuretools"
      - name: Pull latest from upstream for user forked feedstock
        run: |
          gh auth status
          gh repo sync alteryx/featuretools-feedstock --branch main --source conda-forge/featuretools-feedstock --force
        env:
          GITHUB_TOKEN: ${{ secrets.AUTO_APPROVE_TOKEN }}
      - uses: actions/checkout@v3
        with:
          repository: alteryx/featuretools-feedstock
          ref: main
          path: "./featuretools-feedstock"
          fetch-depth: '0'
      - name: Run Create Feedstock meta YAML
        id: create-feedstock-meta
        uses: alteryx/create-feedstock-meta-yaml@v4
        with:
          project: "featuretools"
          pypi_version: ${{ github.event.inputs.version }}
          project_metadata_filepath: "featuretools/pyproject.toml"
          meta_yaml_filepath: "featuretools-feedstock/recipe/meta.yaml"
          add_to_test_requirements: "graphviz !=2.47.2"
      - name: View updated meta yaml
        run: cat featuretools-feedstock/recipe/meta.yaml
      - name: Push updated yaml
        run: |
          cd featuretools-feedstock
          git config --unset-all http.https://github.com/.extraheader
          git config --global user.email "machineOSS@alteryx.com"
          git config --global user.name "machineAYX Bot"
          git remote set-url origin https://${{ secrets.AUTO_APPROVE_TOKEN }}@github.com/alteryx/featuretools-feedstock
          git checkout -b ${{ github.event.inputs.version }}
          git add recipe/meta.yaml
          git commit -m "${{ github.event.inputs.version }}"
          git push origin ${{ github.event.inputs.version }}
      - name: Adding URL to job output
        run: |
          echo "Conda Feedstock Pull Request: https://github.com/alteryx/featuretools-feedstock/pull/new/${{ github.event.inputs.version }}" >> $GITHUB_STEP_SUMMARY


================================================
FILE: .github/workflows/install_test.yaml
================================================
name: Install Test
on:
  pull_request:
    types: [opened, synchronize]
  push:
    branches:
      - main
env:
  ALTERYX_OPEN_SRC_UPDATE_CHECKER: False
jobs:
  install_ft_complete:
    name: ${{ matrix.os }} - ${{ matrix.python_version }} install featuretools complete
    strategy:
      fail-fast: false
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
        python_version: ["3.9", "3.10", "3.11", "3.12"]
    runs-on: ${{ matrix.os }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v3
        with:
          ref: ${{ github.event.pull_request.head.ref }}
          repository: ${{ github.event.pull_request.head.repo.full_name }}
      - name: Set up python ${{ matrix.python_version }}
        uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python_version }}
          cache: 'pip' 
          cache-dependency-path: 'pyproject.toml'
      - name: Build featuretools package
        run: |
          make package
      - name: Install complete version of featuretools from sdist
        run: |
          python -m pip install "unpacked_sdist/[complete]"
      - name: Test by importing packages
        run: |
          python -c "import premium_primitives"
          python -c "from nlp_primitives import PolarityScore"
      - name: Check package conflicts
        run: |
          python -m pip check
      - name: Verify extra_requires commands
        run: |
          python -m pip install "unpacked_sdist/[nlp]"


================================================
FILE: .github/workflows/kickoff_evalml_unit_tests.yaml
================================================
name: Kickoff EvalML Unit Tests

on:
  push:
    branches:
      - main
  workflow_dispatch:

jobs:
  kickoff:
    name: Run EvalML unit tests
    if: github.repository_owner == 'alteryx'
    runs-on: ubuntu-latest
    steps:
      - name: Run workflow for EvalML unit tests
        run: gh workflow run unit_tests_with_featuretools_main_branch.yaml --repo "alteryx/evalml"
        env:
          GITHUB_TOKEN: ${{ secrets.REPO_SCOPED_TOKEN }}


================================================
FILE: .github/workflows/latest_dependency_checker.yaml
================================================
# This workflow will install dependenies and if any critical dependencies have changed a pull request
# will be created which will trigger a CI run with the new dependencies.

name: Latest Dependency Checker
on:
  schedule:
    - cron: '0 * * * *'
  workflow_dispatch:
jobs:
  build:
    if: ${{ github.repository_owner == 'alteryx' }}
    runs-on: ubuntu-latest
    timeout-minutes: 5
    steps:
      - name: Checkout repository
        uses: actions/checkout@v3
        with:
          ref: ${{ github.event.pull_request.head.ref }}
          repository: ${{ github.event.pull_request.head.repo.full_name }}
      - uses: actions/setup-python@v4
        with:
          python-version: 3.9
      - name: Update dependencies
        run: |
          python -m pip install --upgrade pip
          python -m pip install -e ".[dask,test]"
          make checkdeps OUTPUT_PATH=featuretools/tests/requirement_files/latest_requirements.txt
      - name: Create pull request
        uses: peter-evans/create-pull-request@v3
        with:
          token: ${{ secrets.REPO_SCOPED_TOKEN }}
          commit-message: Update latest dependencies
          title: Automated Latest Dependency Updates
          author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
          body: "This is an auto-generated PR with **latest** dependency updates.
                Please do not delete the `latest-dep-update` branch because it's needed by the auto-dependency bot."
          branch: latest-dep-update
          branch-suffix: short-commit-hash
          base: main
          assignees: machineFL
          reviewers: machineAYX


================================================
FILE: .github/workflows/lint_check.yaml
================================================
name: Lint Check
on:
  pull_request:
    types: [opened, synchronize]
  push:
    branches:
      - main
jobs:
  lint_check:
    name: ${{ matrix.python_version }} lint check
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python_version: ["3.12"]
    steps:
      - name: Checkout repository
        uses: actions/checkout@v3
        with:
          ref: ${{ github.event.pull_request.head.ref }}
          repository: ${{ github.event.pull_request.head.repo.full_name }}
      - name: Set up python ${{ matrix.python_version }}
        uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python_version }}
          cache: 'pip' 
          cache-dependency-path: 'pyproject.toml'
      - uses: actions/cache@v3
        id: cache
        with:
          path: ${{ env.pythonLocation }} 
          key: ${{ matrix.python_version }}-lint-${{ env.pythonLocation }}-${{ hashFiles('**/pyproject.toml') }}-v01
      - name: Install featuretools with optional, dev, and test requirements (not using cache)
        if: steps.cache.outputs.cache-hit != 'true'
        run: |
          python -m pip install -e .[dev]
      - name: Install featuretools with no requirements (using cache)
        if: steps.cache.outputs.cache-hit == 'true'
        run: |
          python -m pip install -e .[dev] --no-deps
      - name: Run lint test
        run: make lint


================================================
FILE: .github/workflows/minimum_dependency_checker.yaml
================================================
name: Minimum Dependency Checker
on:
  workflow_dispatch:
  push:
    branches:
      - main
    paths:
      - 'pyproject.toml'
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v3
        with:
          ref: ${{ github.event.pull_request.head.ref }}
          repository: ${{ github.event.pull_request.head.repo.full_name }}
      - name: Run min dep generator - test reqs
        id: min_dep_gen_test
        uses: alteryx/minimum-dependency-generator@v3
        with:
          paths: 'pyproject.toml'
          options: 'dependencies'
          extras_require: 'test'
          output_filepath: featuretools/tests/requirement_files/minimum_test_requirements.txt
      - name: Run min dep generator - core reqs
        id: min_dep_gen_core
        uses: alteryx/minimum-dependency-generator@v3
        with:
          paths: 'pyproject.toml'
          options: 'dependencies'
          output_filepath: featuretools/tests/requirement_files/minimum_core_requirements.txt
      - name: Run min dep generator - dask
        id: min_dep_gen_dask
        uses: alteryx/minimum-dependency-generator@v3
        with:
          paths: 'pyproject.toml'
          options: 'dependencies'
          extras_require: 'dask'
          output_filepath: featuretools/tests/requirement_files/minimum_dask_requirements.txt
      - name: Create Pull Request
        uses: peter-evans/create-pull-request@v3
        with:
          token: ${{ secrets.REPO_SCOPED_TOKEN }}
          commit-message: Update minimum dependencies
          title: Automated Minimum Dependency Updates
          author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
          body: "This is an auto-generated PR with **minimum** dependency updates.
                 Please do not delete the `min-dep-update` branch because it's needed by the auto-dependency bot."
          branch: min-dep-update
          branch-suffix: short-commit-hash
          base: main
          assignees: machineFL
          reviewers: machineAYX


================================================
FILE: .github/workflows/performance-check.yaml
================================================
name: performance-check
on:
  push:
    branches:
      - main
  workflow_dispatch:
jobs:
  run-performance-analysis:
    runs-on: ubuntu-latest
    steps:
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}
      - name: Run Lambda
        env:
          lambda_function: ${{ secrets.LAMBDA_FUNC }}
        run: |
          echo "{\"TestCommit\": \"$GITHUB_SHA\", \"Flags\": \"--upload-slack\"}" | base64 > payload.b64
          aws lambda invoke --function-name $lambda_function --payload file://payload.b64 --invocation-type Event /dev/stdout 1>/dev/null


================================================
FILE: .github/workflows/pull_request_check.yaml
================================================
name: Pull Request Check
on:
  pull_request:
    types: [opened, edited, reopened, synchronize]
jobs:
  pull_request_check:
    name: pull request check
    runs-on: ubuntu-latest
    steps:
      - uses: nearform-actions/github-action-check-linked-issues@v1.4.5
        id: check-linked-issues
        with:
          exclude-branches: "release_v**, backport_v**, main, latest-dep-update-**, min-dep-update-**, dependabot/**"          
          github-token: ${{ secrets.REPO_SCOPED_TOKEN }}


================================================
FILE: .github/workflows/release.yaml
================================================
on:
  release:
    types: [published]

name: Release
jobs:
  pypi-publish:
    name: PyPI Release
    runs-on: ubuntu-latest
    permissions:
      id-token: write
    steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-python@v5
    - name: Install deps
      run: |
        python -m pip install --quiet --upgrade pip
        python -m pip install --quiet --upgrade build
        python -m pip install --quiet --upgrade setuptools
    - name: Remove build artifacts and docs
      run: |
        rm -rf .eggs/ dist/ build/ docs/
    - name: Build distribution
      run: python -m build

    - name: Publish package distributions to PyPI
      uses: pypa/gh-action-pypi-publish@release/v1
    - name: Run workflow to create feedstock pull request
      run: |
        gh workflow run create_feedstock_pr.yaml --repo "alteryx/featuretools" -f version=${{ github.event.release.tag_name }}
      env:
        GITHUB_TOKEN: ${{ secrets.REPO_SCOPED_TOKEN }}


================================================
FILE: .github/workflows/release_notes_updated.yaml
================================================
name: Release Notes Updated
on:
  pull_request:
    types: [opened, synchronize]
jobs:
  release_notes_updated:
    name: release notes updated
    runs-on: ubuntu-latest
    steps:
      - name: Check for development branch
        id: branch
        shell: python
        env:
          REF: ${{ github.event.pull_request.head.ref }}
        run: |
          from re import compile
          import os
          main = '^main$'
          release = '^release_v\d+\.\d+\.\d+$'
          backport = '^backport_v\d+\.\d+\.\d+$'
          dep_update = '^latest-dep-update-[a-f0-9]{7}$'
          min_dep_update = '^min-dep-update-[a-f0-9]{7}$'
          regex = main, release, backport, dep_update, min_dep_update
          patterns = list(map(compile, regex))
          ref = os.environ["REF"]
          is_dev = not any(pattern.match(ref) for pattern in patterns)
          print('::set-output name=is_dev::' + str(is_dev))
      - if: ${{ steps.branch.outputs.is_dev == 'true' }}
        name: Checkout repository
        uses: actions/checkout@v3
        with:
          ref: ${{ github.event.pull_request.head.ref }}
          repository: ${{ github.event.pull_request.head.repo.full_name }}
      - if: ${{ steps.branch.outputs.is_dev == 'true' }}
        name: Check if release notes were updated
        run: cat docs/source/release_notes.rst | grep ":pr:\`${{ github.event.number }}\`"
        

================================================
FILE: .github/workflows/test_without_test_dependencies.yaml
================================================
name: Test without Test Dependencies
on:
  pull_request:
    types: [opened, synchronize]
  push:
    branches:
      - main
  workflow_dispatch:
jobs:
  use_featuretools_without_test_dependencies:
    name: Test featuretools without Test Dependencies
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
    steps:
      - name: Set up python 3.10
        uses: actions/setup-python@v4
        with:
          python-version: "3.10"
      - name: Checkout repository
        uses: actions/checkout@v3
        with:
          ref: ${{ github.event.pull_request.head.ref }}
          repository: ${{ github.event.pull_request.head.repo.full_name }}
      - name: Build featuretools and install
        run: |
          make package
          python -m pip install unpacked_sdist/
      - name: Run simple featuretools usage
        run: |
          import featuretools as ft
          es = ft.demo.load_mock_customer(return_entityset=True)
          ft.dfs(
              entityset=es,
              target_dataframe_name="customers",
              agg_primitives=["count"],
              trans_primitives=["month"],
              max_depth=1,
          )
          from featuretools.primitives import IsFreeEmailDomain
          is_free_email_domain = IsFreeEmailDomain()
          is_free_email_domain(['name@gmail.com', 'name@featuretools.com']).tolist()
        shell: python


================================================
FILE: .github/workflows/tests_with_latest_deps.yaml
================================================
name: Tests
on:
  pull_request:
    types: [opened, synchronize]
  push:
    branches:
      - main
  workflow_dispatch:
jobs:
  tests:
    name: ${{ matrix.python_version }} unit tests
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python_version: ["3.9", "3.10", "3.11", "3.12"]

    steps:
      - uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python_version }}
      - name: Checkout repository
        uses: actions/checkout@v3
        with:
          ref: ${{ github.event.pull_request.head.ref }}
          repository: ${{ github.event.pull_request.head.repo.full_name }}
      - name: Build featuretools package
        run: make package
      - name: Set up pip and graphviz
        run: |
          pip config --site set global.progress_bar off
          python -m pip install --upgrade pip
          sudo apt update && sudo apt install -y graphviz
      - name: Install featuretools with test requirements
        run: |
          python -m pip install -e unpacked_sdist/
          python -m pip install -e unpacked_sdist/[test,dask]
      - if: ${{ matrix.python_version == 3.9 }}
        name: Generate coverage args
        run: echo "coverage_args=--cov=featuretools --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml" >> $GITHUB_ENV
      - if: ${{ env.coverage_args }}
        name: Erase coverage files
        run: |
          cd unpacked_sdist
          coverage erase
      - name: Run unit tests
        run: |
          cd unpacked_sdist
          pytest featuretools/ -n auto ${{ env.coverage_args }}
      - if: ${{ env.coverage_args }}
        name: Upload coverage to Codecov
        uses: codecov/codecov-action@v3
        with:
          token: ${{ secrets.CODECOV_TOKEN }}
          fail_ci_if_error: true
          files: ${{ github.workspace }}/coverage.xml
          verbose: true


  win_unit_tests:
    name: ${{ matrix.python_version }} windows unit tests
    runs-on: windows-latest
    strategy:
      fail-fast: false
      matrix:
        python_version: ["3.9", "3.10", "3.11", "3.12"]
    steps:
      - name: Download miniconda
        shell: pwsh
        run: |
          $File = "Miniconda3-latest-Windows-x86_64.exe"
          $Uri = "https://repo.anaconda.com/miniconda/$File"
          $ProgressPreference = "silentlyContinue"
          Invoke-WebRequest -Uri $Uri -Outfile "$env:USERPROFILE/$File"
          $hashFromFile = Get-FileHash "$env:USERPROFILE/$File" -Algorithm SHA256
          $hashFromUrl = "f4d6147b40ea6822255c2dcec8bb0d357c09e230976213f70d7b8c4a10d86bb0"
          if ($hashFromFile.Hash -ne "$hashFromUrl") {
            Throw "$File hashes do not match"
          }
      - name: Install miniconda
        shell: cmd
        run: start /wait "" %UserProfile%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /D=%UserProfile%\Miniconda3
      - name: Create python ${{ matrix.python_version }} environment
        shell: pwsh
        run: |
          . $env:USERPROFILE\Miniconda3\shell\condabin\conda-hook.ps1
          conda create -n featuretools python=${{ matrix.python_version }}
      - name: Checkout repository
        uses: actions/checkout@v3
        with:
          ref: ${{ github.event.pull_request.head.ref }}
          repository: ${{ github.event.pull_request.head.repo.full_name }}
      - name: Install featuretools with test requirements
        shell: pwsh
        run: |
          . $env:USERPROFILE\Miniconda3\shell\condabin\conda-hook.ps1
          conda activate featuretools
          conda config --add channels conda-forge
          conda install -q -y -c conda-forge python-graphviz graphviz
          python -m pip install --upgrade pip
          python -m pip install .[test,dask]
      - name: Run unit tests
        run: |
          . $env:USERPROFILE\Miniconda3\shell\condabin\conda-hook.ps1
          conda activate featuretools
          pytest featuretools\ -n auto


================================================
FILE: .github/workflows/tests_with_minimum_deps.yaml
================================================
name: Tests - Minimum Dependencies
on:
  pull_request:
    types: [opened, synchronize]
  push:
    branches:
      - main
  workflow_dispatch:
jobs:
  py39_tests_minimum_dependencies:
    name: Tests - 3.9 Minimum Dependencies
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python_version: ["3.9"]
    steps:
      - name: Checkout repository
        uses: actions/checkout@v3
        with:
          ref: ${{ github.event.pull_request.head.ref }}
          repository: ${{ github.event.pull_request.head.repo.full_name }}
      - uses: actions/setup-python@v4
        with:
          python-version: 3.9
      - name: Config pip, upgrade pip, and install graphviz
        run: |
          sudo apt update
          sudo apt install -y graphviz
          pip config --site set global.progress_bar off
          python -m pip install --upgrade pip
          python -m pip install wheel
      - name: Install featuretools with no dependencies
        run: |
          python -m pip install -e . --no-dependencies
      - name: Install featuretools - minimum tests dependencies
        run: |
          python -m pip install -r featuretools/tests/requirement_files/minimum_test_requirements.txt
      - name: Install featuretools - minimum core dependencies
        run: |
          python -m pip install -r featuretools/tests/requirement_files/minimum_core_requirements.txt
      - name: Install featuretools - minimum Dask dependencies
        run: |
          python -m pip install -r featuretools/tests/requirement_files/minimum_dask_requirements.txt
      - name: Run unit tests without code coverage
        run: python -m pytest -x -n auto featuretools/tests/

================================================
FILE: .github/workflows/tests_with_woodwork_main_branch.yaml
================================================
name: Tests - Featuretools with Woodwork main branch
on:
  workflow_dispatch:
jobs:
  tests_woodwork_main:
    if: ${{ github.repository_owner == 'alteryx' }}
    name: ${{ matrix.python_version }} tests ${{ matrix.libraries }}
    runs-on: ubuntu-latest
    strategy:
      fail-fast: true
      matrix:
        python_version: ["3.9", "3.10", "3.11", "3.12"]

    steps:
      - uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python_version }}
      - name: Checkout repository
        uses: actions/checkout@v3
      - name: Build featuretools package
        run: make package
      - name: Set up pip and graphviz
        run: |
          pip config --site set global.progress_bar off
          python -m pip install -U pip
          sudo apt update && sudo apt install -y graphviz
      - name: Install Woodwork & Featuretools - test requirements
        run: |
          python -m pip install -e unpacked_sdist/[test,dask]
          python -m pip uninstall -y woodwork
          python -m pip install https://github.com/alteryx/woodwork/archive/main.zip
      - name: Log test run info
        run: |
          echo "Run unit tests without code coverage for ${{ matrix.python_version }}"
          echo "Testing with woodwork version:" `python -c "import woodwork; print(woodwork.__version__)"`
      - name: Run unit tests without code coverage
        run: pytest featuretools/ -n auto

  slack_alert_failure:
    name: Send Slack alert if failure
    needs: tests_woodwork_main
    runs-on: ubuntu-latest
    if: ${{ always() }}
    steps:
      - name: Send Slack alert if failure
        if: ${{ needs.tests_woodwork_main.result != 'success' }}
        id: slack
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "url": "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}


================================================
FILE: .gitignore
================================================
#
docs/source/generated/
docs/source/getting_started/graphs
venv/
data/
installed/
output.csv
htmlcov/
.idea/
featuretools/tests/integration_data/*.csv
featuretools/tests/integration_data/*.gzip
featuretools/tests/integration_data/customers.gzip
featuretools/tests/integration_data/log-0.gzip
featuretools/tests/integration_data/log-1.gzip
featuretools/tests/integration_data/log.gzip
featuretools/tests/integration_data/products.gzip
featuretools/tests/integration_data/regions.gzip
featuretools/tests/integration_data/sessions.gzip
featuretools/tests/integration_data/stores.gzip
**/dask-worker-space/*
*.dirlock
*.~lock*
unpacked_sdist/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
**/.DS_Store
.DS_Store

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# dotenv
.env

# virtualenv
.venv
venv/
ENV/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

# pickle files
*.p
*.pickle

.pytest_cache

#IDE
.vscode
.devcontainer

*.stats
Dockerfile.arm
.dockerignore


================================================
FILE: .pre-commit-config.yaml
================================================
exclude: |
  (?x)
  .html$|.csv$|.svg$|.md$|.txt$|.json$|.xml$|.pickle$|^.github/|
  (LICENSE.*|README.*)
repos:
  - repo: https://github.com/kynan/nbstripout
    rev: 0.5.0
    hooks:
      - id: nbstripout
        entry: nbstripout
        language: python
        types: [jupyter]
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.3.0
    hooks:
      - id: end-of-file-fixer
      - id: trailing-whitespace
  - repo: https://github.com/MarcoGorelli/absolufy-imports
    rev: v0.3.1
    hooks:
      - id: absolufy-imports
        files: ^featuretools/
  - repo: https://github.com/asottile/add-trailing-comma
    rev: v2.2.3
    hooks:
      - id: add-trailing-comma
        name: Add trailing comma
  - repo: https://github.com/charliermarsh/ruff-pre-commit
    rev: 'v0.3.3'
    hooks:
      - id: ruff
        types_or: [ python, pyi, jupyter ]
        args:
          - --fix
          - --config=./pyproject.toml
      - id: ruff-format
        types_or: [ python, pyi, jupyter ]
        args:
          - --config=./pyproject.toml


================================================
FILE: .readthedocs.yaml
================================================
# .readthedocs.yaml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Build documentation in the docs/ directory with Sphinx
sphinx:
  configuration: docs/source/conf.py

# Optionally build your docs in additional formats such as PDF and ePub
formats: []

build:
  os: "ubuntu-22.04"
  tools:
    python: "3.9"
  apt_packages:
    - graphviz
    - openjdk-11-jre-headless
  jobs:
    post_build:
      - export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"

python:
  install:
    - method: pip
      path: .
      extra_requirements:
        - docs


================================================
FILE: LICENSE
================================================
BSD 3-Clause License

Copyright (c) 2017, Feature Labs, Inc.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


================================================
FILE: Makefile
================================================
.PHONY: clean
clean:
	find . -name '*.pyo' -delete
	find . -name '*.pyc' -delete
	find . -name __pycache__ -delete
	find . -name '*~' -delete
	find . -name '.coverage.*' -delete

.PHONY: lint
lint:
	python docs/notebook_version_standardizer.py check-execution
	ruff check . --config=./pyproject.toml
	ruff format . --check --config=./pyproject.toml

.PHONY: lint-fix
lint-fix:
	python docs/notebook_version_standardizer.py standardize
	ruff check . --fix --config=./pyproject.toml
	ruff format . --config=./pyproject.toml

.PHONY: test
test:
	python -m pytest featuretools/ -n auto

.PHONY: testcoverage
testcoverage:
	python -m pytest featuretools/ --cov=featuretools -n auto

.PHONY: installdeps
installdeps: upgradepip
	pip install -e .

.PHONY: installdeps-dev
installdeps-dev: upgradepip
	pip install -e ".[dev]"
	pre-commit install

.PHONY: installdeps-test
installdeps-test: upgradepip
	pip install -e ".[test]"

.PHONY: checkdeps
checkdeps:
	$(eval allow_list='holidays|scipy|numpy|pandas|tqdm|cloudpickle|distributed|dask|psutil|woodwork')
	pip freeze | grep -v "alteryx/featuretools.git" | grep -E $(allow_list) > $(OUTPUT_PATH)

.PHONY: upgradepip
upgradepip:
	python -m pip install --upgrade pip

.PHONY: upgradebuild
upgradebuild:
	python -m pip install --upgrade build

.PHONY: upgradesetuptools
upgradesetuptools:
	python -m pip install --upgrade setuptools

.PHONY: package
package: upgradepip upgradebuild upgradesetuptools
	python -m build
	$(eval PACKAGE=$(shell python -c 'import setuptools; setuptools.setup()' --version))
	tar -zxvf "dist/featuretools-${PACKAGE}.tar.gz"
	mv "featuretools-${PACKAGE}" unpacked_sdist


================================================
FILE: README.md
================================================
<p align="center">
<img width=50% src="https://www.featuretools.com/wp-content/uploads/2017/12/FeatureLabs-Logo-Tangerine-800.png" alt="Featuretools" />
</p>
<p align="center">
<i>"One of the holy grails of machine learning is to automate more and more of the feature engineering process."</i> ― Pedro Domingos, <a href="https://bit.ly/things_to_know_ml">A Few Useful Things to Know about Machine Learning</a>
</p>

<p align="center">
    <a href="https://github.com/alteryx/featuretools/actions/workflows/tests_with_latest_deps.yaml" alt="Tests" target="_blank">
        <img src="https://github.com/alteryx/featuretools/actions/workflows/tests_with_latest_deps.yaml/badge.svg?branch=main" alt="Tests" />
    </a>
    <a href="https://codecov.io/gh/alteryx/featuretools">
        <img src="https://codecov.io/gh/alteryx/featuretools/branch/main/graph/badge.svg"/>
    </a>
    <a href='https://featuretools.alteryx.com/en/stable/?badge=stable'>
        <img src='https://readthedocs.com/projects/feature-labs-inc-featuretools/badge/?version=stable' alt='Documentation Status' />
    </a>
    <a href="https://badge.fury.io/py/featuretools" target="_blank">
        <img src="https://badge.fury.io/py/featuretools.svg?maxAge=2592000" alt="PyPI Version" />
    </a>
    <a href="https://anaconda.org/conda-forge/featuretools" target="_blank">
        <img src="https://anaconda.org/conda-forge/featuretools/badges/version.svg" alt="Anaconda Version" />
    </a>
    <a href="https://stackoverflow.com/questions/tagged/featuretools" target="_blank">
        <img src="http://img.shields.io/badge/questions-on_stackoverflow-blue.svg" alt="StackOverflow" />
    </a>
    <a href="https://pepy.tech/project/featuretools" target="_blank">
        <img src="https://static.pepy.tech/badge/featuretools/month" alt="PyPI Downloads" />
    </a>
</p>
<hr>

[Featuretools](https://www.featuretools.com) is a python library for automated feature engineering. See the [documentation](https://docs.featuretools.com) for more information.

## Installation
Install with pip

```
python -m pip install featuretools
```

or from the Conda-forge channel on [conda](https://anaconda.org/conda-forge/featuretools):

```
conda install -c conda-forge featuretools
```

### Add-ons

You can install add-ons individually or all at once by running:

```
python -m pip install "featuretools[complete]"
```

**Premium Primitives** - Use Premium Primitives from the premium-primitives repo

```
python -m pip install "featuretools[premium]"
```

**NLP Primitives** - Use Natural Language Primitives from the nlp-primitives repo

```
python -m pip install "featuretools[nlp]"
```

**Dask Support** - Use Dask to run DFS with njobs > 1

```
python -m pip install "featuretools[dask]"
```

## Example
Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions.

```python
>> import featuretools as ft
>> es = ft.demo.load_mock_customer(return_entityset=True)
>> es.plot()
```

<img src="https://github.com/alteryx/featuretools/blob/main/docs/source/_static/images/entity_set.png?raw=true" width="350">

Featuretools can automatically create a single table of features for any "target dataframe"
```python
>> feature_matrix, features_defs = ft.dfs(entityset=es, target_dataframe_name="customers")
>> feature_matrix.head(5)
```

```
            zip_code  COUNT(transactions)  COUNT(sessions)  SUM(transactions.amount) MODE(sessions.device)  MIN(transactions.amount)  MAX(transactions.amount)  YEAR(join_date)  SKEW(transactions.amount)  DAY(join_date)                   ...                     SUM(sessions.MIN(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  STD(sessions.SUM(transactions.amount))  STD(sessions.MEAN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  STD(sessions.MAX(transactions.amount))  NUM_UNIQUE(sessions.DAY(session_start))  MIN(sessions.SKEW(transactions.amount))
customer_id                                                                                                                                                                                                                                  ...
1              60091                  131               10                  10236.77               desktop                      5.60                    149.95             2008                   0.070041               1                   ...                                                     169.77                                 0.610052                                   41.95                               791.976505                              175.939423                                 9.299023                                 -0.377150                                5.857976                                        1                                -0.395358
2              02139                  122                8                   9118.81                mobile                      5.81                    149.15             2008                   0.028647              20                   ...                                                     114.85                                 0.492531                                   42.96                               596.243506                              230.333502                                10.925037                                  0.962350                                7.420480                                        1                                -0.470007
3              02139                   78                5                   5758.24               desktop                      6.78                    147.73             2008                   0.070814              10                   ...                                                      64.98                                 0.645728                                   21.77                               369.770121                              471.048551                                 9.819148                                 -0.244976                               12.537259                                        1                                -0.630425
4              60091                  111                8                   8205.28               desktop                      5.73                    149.56             2008                   0.087986              30                   ...                                                      83.53                                 0.516262                                   17.27                               584.673126                              322.883448                                13.065436                                 -0.548969                               12.738488                                        1                                -0.497169
5              02139                   58                4                   4571.37                tablet                      5.91                    148.17             2008                   0.085883              19                   ...                                                      73.09                                 0.830112                                   27.46                               313.448942                              198.522508                                 8.950528                                  0.098885                                5.599228                                        1                                -0.396571

[5 rows x 69 columns]
```
We now have a feature vector for each customer that can be used for machine learning. See the [documentation on Deep Feature Synthesis](https://featuretools.alteryx.com/en/stable/getting_started/afe.html) for more examples.

Featuretools contains many different types of built-in primitives for creating features. If the primitive you need is not included, Featuretools also allows you to [define your own custom primitives](https://featuretools.alteryx.com/en/stable/getting_started/primitives.html#defining-custom-primitives).

## Demos
**Predict Next Purchase**

[Repository](https://github.com/alteryx/open_source_demos/blob/main/predict-next-purchase/) | [Notebook](https://github.com/alteryx/open_source_demos/blob/main/predict-next-purchase/Tutorial.ipynb)

In this demonstration, we use a multi-table dataset of 3 million online grocery orders from Instacart to predict what a customer will buy next. We show how to generate features with automated feature engineering and build an accurate machine learning pipeline using Featuretools, which can be reused for multiple prediction problems. For more advanced users, we show how to scale that pipeline to a large dataset using Dask.

For more examples of how to use Featuretools, check out our [demos](https://www.featuretools.com/demos) page.

## Testing & Development

The Featuretools community welcomes pull requests. Instructions for testing and development are available [here.](https://featuretools.alteryx.com/en/stable/install.html#development)

## Support
The Featuretools community is happy to provide support to users of Featuretools. Project support can be found in four places depending on the type of question:

1. For usage questions, use [Stack Overflow](https://stackoverflow.com/questions/tagged/featuretools) with the `featuretools` tag.
2. For bugs, issues, or feature requests start a [Github issue](https://github.com/alteryx/featuretools/issues).
3. For discussion regarding development on the core library, use [Slack](https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA).
4. For everything else, the core developers can be reached by email at open_source_support@alteryx.com

## Citing Featuretools

If you use Featuretools, please consider citing the following paper:

James Max Kanter, Kalyan Veeramachaneni. [Deep feature synthesis: Towards automating data science endeavors.](https://dai.lids.mit.edu/wp-content/uploads/2017/10/DSAA_DSM_2015.pdf) *IEEE DSAA 2015*.

BibTeX entry:

```bibtex
@inproceedings{kanter2015deep,
  author    = {James Max Kanter and Kalyan Veeramachaneni},
  title     = {Deep feature synthesis: Towards automating data science endeavors},
  booktitle = {2015 {IEEE} International Conference on Data Science and Advanced Analytics, DSAA 2015, Paris, France, October 19-21, 2015},
  pages     = {1--10},
  year      = {2015},
  organization={IEEE}
}
```

## Built at Alteryx

**Featuretools** is an open source project maintained by [Alteryx](https://www.alteryx.com). To see the other open source projects we’re working on visit [Alteryx Open Source](https://www.alteryx.com/open-source). If building impactful data science pipelines is important to you or your business, please get in touch.

<p align="center">
  <a href="https://www.alteryx.com/open-source">
    <img src="https://alteryx-oss-web-images.s3.amazonaws.com/OpenSource_Logo-01.png" alt="Alteryx Open Source" width="800"/>
  </a>
</p>


================================================
FILE: contributing.md
================================================
# Contributing to Featuretools

:+1::tada: First off, thank you for taking the time to contribute! :tada::+1:

Whether you are a novice or experienced software developer, all contributions and suggestions are welcome!

There are many ways to contribute to Featuretools, with the most common ones being contribution of code or documentation to the project.

**To contribute, you can:**
1. Help users on our [Slack channel](https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA). Answer questions under the featuretools tag on [Stack Overflow](https://stackoverflow.com/questions/tagged/featuretools)

2. Submit a pull request for one of [Good First Issues](https://github.com/alteryx/featuretools/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+First+Issue%22)

3. Make changes to the codebase, see [Contributing to the codebase](#Contributing-to-the-Codebase).

4. Improve our documentation, which can be found under the [docs](docs/) directory or at https://docs.featuretools.com

5. [Report issues](#Report-issues) you're facing, and give a "thumbs up" on issues that others reported and that are relevant to you. Issues should be used for bugs, and feature requests only.

6. Spread the word: reference Featuretools from your blog and articles, link to it from your website, or simply star it in GitHub to say "I use it".
    * If you would like to be featured on [ecosystem page](https://featuretools.alteryx.com/en/stable/resources/ecosystem.html), you can submit a [pull request](https://github.com/alteryx/featuretools).

## Contributing to the Codebase

Before starting major work, you should touch base with the maintainers of Featuretools by filing an issue on GitHub or posting a message in the [#development channel on Slack](https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA). This will increase the likelihood your pull request will eventually get merged in.

#### 1. Fork and clone repo
* The code is hosted on GitHub, so you will need to use Git to fork the project and make changes to the codebase. To start, go to the [Featuretools GitHub page](https://github.com/alteryx/featuretools) and click the `Fork` button.
* After you have created the fork, you will want to clone the fork to your machine and connect your version of the project to the upstream Featuretools repo.
  ```bash
  git clone https://github.com/your-user-name/featuretools.git
  cd featuretools
  git remote add upstream https://github.com/alteryx/featuretools
  ```
* Once you have obtained a copy of the code, you should create a development environment that is separate from your existing Python environment so that you can make and test changes without compromising your own work environment. You can run the following steps to create a separate virtual environment, and install Featuretools in editable mode.
  ```bash
  python -m venv venv
  source venv/bin/activate
  make installdeps
  git checkout -b issue####-branch_name
  ```

* You will need to install GraphViz, and Pandoc to run all unit tests & build docs:

  > Pandoc is only needed to build the documentation locally.

     **macOS (Intel)** (use [Homebrew](https://brew.sh/)):
     ```console
     brew install graphviz pandoc
     ```

     **macOS (M1)** (use [Homebrew](https://brew.sh/)):
     ```console
     brew install graphviz pandoc
     ```

     **Ubuntu**:
     ```console
     sudo apt install graphviz pandoc -y
     ```

#### 2. Implement your Pull Request

* Implement your pull request. If needed, add new tests or update the documentation.
* Before submitting to GitHub, verify the tests run and the code lints properly
  ```bash
  # runs linting
  make lint

  # will fix some common linting issues automatically
  make lint-fix

  # runs test
  make test
  ```
* If you made changes to the documentation, build the documentation locally.
  ```bash
  # go to docs and build
  cd docs
  make html

  # view docs locally
  open build/html/index.html
  ```
* Before you commit, a few lint fixing hooks will run. You can also manually run these.
  ```bash
  # run linting hooks only on changed files
  pre-commit run

  # run linting hooks on all files
  pre-commit run --all-files
  ```

#### 3. Submit your Pull Request

* Once your changes are ready to be submitted, make sure to push your changes to GitHub before creating a pull request.
* If you need to update your code with the latest changes from the main Featuretools repo, you can do that by running the commands below, which will merge the latest changes from the Featuretools `main` branch into your current local branch. You may need to resolve merge conflicts if there are conflicts between your changes and the upstream changes. After the merge, you will need to push the updates to your forked repo after running these commands.
  ```bash
  git fetch upstream
  git merge upstream/main
  ```
* Create a pull request to merge the changes from your forked repo branch into the Featuretools `main` branch. Creating the pull request will automatically run our continuous integration.
* If this is your first contribution, you will need to sign the Contributor License Agreement as directed.
* Update the "Future Release" section of the release notes (`docs/source/release_notes.rst`) to include your pull request and add your github username to the list of contributors.  Add a description of your PR to the subsection that most closely matches your contribution:
    * Enhancements: new features or additions to Featuretools.
    * Fixes: things like bugfixes or adding more descriptive error messages.
    * Changes: modifications to an existing part of Featuretools.
    * Documentation Changes
    * Testing Changes

   Documentation or testing changes rarely warrant an individual release notes entry; the PR number can be added to their respective "Miscellaneous changes" entries.
* We will review your changes, and you will most likely be asked to make additional changes before it is finally ready to merge. However, once it's reviewed by a maintainer of Featuretools, passes continuous integration, we will merge it, and you will have successfully contributed to Featuretools!

## Report issues
When reporting issues please include as much detail as possible about your operating system, Featuretools version and python version. Whenever possible, please also include a brief, self-contained code example that demonstrates the problem.


================================================
FILE: docs/Makefile
================================================
# Makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS    =
SPHINXBUILD   = sphinx-build
PAPER         =
BUILDDIR      = build
GENDIR        = source/generated

# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
	$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don\'t have Sphinx installed, grab it from http://sphinx-doc.org/)
endif

# Internal variables.
PAPEROPT_a4     = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source

.PHONY: help
help:
	@echo "Please use \`make <target>' where <target> is one of"
	@echo "  html       to make standalone HTML files"
	@echo "  dirhtml    to make HTML files named index.html in directories"
	@echo "  singlehtml to make a single large HTML file"
	@echo "  pickle     to make pickle files"
	@echo "  json       to make JSON files"
	@echo "  htmlhelp   to make HTML files and a HTML help project"
	@echo "  qthelp     to make HTML files and a qthelp project"
	@echo "  applehelp  to make an Apple Help Book"
	@echo "  devhelp    to make HTML files and a Devhelp project"
	@echo "  epub       to make an epub"
	@echo "  epub3      to make an epub3"
	@echo "  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
	@echo "  latexpdf   to make LaTeX files and run them through pdflatex"
	@echo "  latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
	@echo "  text       to make text files"
	@echo "  man        to make manual pages"
	@echo "  texinfo    to make Texinfo files"
	@echo "  info       to make Texinfo files and run them through makeinfo"
	@echo "  gettext    to make PO message catalogs"
	@echo "  changes    to make an overview of all changed/added/deprecated items"
	@echo "  xml        to make Docutils-native XML files"
	@echo "  pseudoxml  to make pseudoxml-XML files for display purposes"
	@echo "  linkcheck  to check all external links for integrity"
	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"
	@echo "  coverage   to run coverage check of the documentation (if enabled)"
	@echo "  dummy      to check syntax errors of document sources"

.PHONY: clean
clean:
	rm -rf $(BUILDDIR)/*
	rm -rf $(GENDIR)/*

.PHONY: html
html:
	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html $(SPHINXOPTS)
	@echo
	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."

.PHONY: dirhtml
dirhtml:
	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
	@echo
	@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."

.PHONY: singlehtml
singlehtml:
	$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
	@echo
	@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."

.PHONY: pickle
pickle:
	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
	@echo
	@echo "Build finished; now you can process the pickle files."

.PHONY: json
json:
	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
	@echo
	@echo "Build finished; now you can process the JSON files."

.PHONY: htmlhelp
htmlhelp:
	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
	@echo
	@echo "Build finished; now you can run HTML Help Workshop with the" \
	      ".hhp project file in $(BUILDDIR)/htmlhelp."

.PHONY: qthelp
qthelp:
	$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
	@echo
	@echo "Build finished; now you can run "qcollectiongenerator" with the" \
	      ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
	@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/featuretools.qhcp"
	@echo "To view the help file:"
	@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/featuretools.qhc"

.PHONY: applehelp
applehelp:
	$(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp
	@echo
	@echo "Build finished. The help book is in $(BUILDDIR)/applehelp."
	@echo "N.B. You won't be able to view it unless you put it in" \
	      "~/Library/Documentation/Help or install it in your application" \
	      "bundle."

.PHONY: devhelp
devhelp:
	$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
	@echo
	@echo "Build finished."
	@echo "To view the help file:"
	@echo "# mkdir -p $$HOME/.local/share/devhelp/featuretools"
	@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/featuretools"
	@echo "# devhelp"

.PHONY: epub
epub:
	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
	@echo
	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."

.PHONY: epub3
epub3:
	$(SPHINXBUILD) -b epub3 $(ALLSPHINXOPTS) $(BUILDDIR)/epub3
	@echo
	@echo "Build finished. The epub3 file is in $(BUILDDIR)/epub3."

.PHONY: latex
latex:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo
	@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
	@echo "Run \`make' in that directory to run these through (pdf)latex" \
	      "(use \`make latexpdf' here to do that automatically)."

.PHONY: latexpdf
latexpdf:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo "Running LaTeX files through pdflatex..."
	$(MAKE) -C $(BUILDDIR)/latex all-pdf
	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."

.PHONY: latexpdfja
latexpdfja:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo "Running LaTeX files through platex and dvipdfmx..."
	$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."

.PHONY: text
text:
	$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
	@echo
	@echo "Build finished. The text files are in $(BUILDDIR)/text."

.PHONY: man
man:
	$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
	@echo
	@echo "Build finished. The manual pages are in $(BUILDDIR)/man."

.PHONY: texinfo
texinfo:
	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
	@echo
	@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
	@echo "Run \`make' in that directory to run these through makeinfo" \
	      "(use \`make info' here to do that automatically)."

.PHONY: info
info:
	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
	@echo "Running Texinfo files through makeinfo..."
	make -C $(BUILDDIR)/texinfo info
	@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."

.PHONY: gettext
gettext:
	$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
	@echo
	@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."

.PHONY: changes
changes:
	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
	@echo
	@echo "The overview file is in $(BUILDDIR)/changes."

.PHONY: linkcheck
linkcheck:
	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
	@echo
	@echo "Link check complete; look for any errors in the above output " \
	      "or in $(BUILDDIR)/linkcheck/output.txt."

.PHONY: doctest
doctest:
	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
	@echo "Testing of doctests in the sources finished, look at the " \
	      "results in $(BUILDDIR)/doctest/output.txt."

.PHONY: coverage
coverage:
	$(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage
	@echo "Testing of coverage in the sources finished, look at the " \
	      "results in $(BUILDDIR)/coverage/python.txt."

.PHONY: xml
xml:
	$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
	@echo
	@echo "Build finished. The XML files are in $(BUILDDIR)/xml."

.PHONY: pseudoxml
pseudoxml:
	$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
	@echo
	@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."

.PHONY: dummy
dummy:
	$(SPHINXBUILD) -b dummy $(ALLSPHINXOPTS) $(BUILDDIR)/dummy
	@echo
	@echo "Build finished. Dummy builder generates no files."


================================================
FILE: docs/backport_release.md
================================================
# Backport Release Process

In situations where we need to backport commits to earlier versions of our software, we'll need to perform the release process slightly differently than a normal release.

<p align="center">
<img width=60% src="source/_static/images/backport_release.png" alt="Backport Release" />
</p>

This document outlines the differences between a normal release and a backport release. It uses the same outline as the [Release Guide](../release.md).

## 0. Pre-Release Checklist

Before starting the backport release process, verify the following:

- Get agreement on the latest commit to use for targeting the release. A backport release will be targeted on some commit other than the latest on main. Many times the new target will be an old release, which will have a tag that can be referenced--for example `v0.11.1`.
- Get agreement on the commits to port over for the backport release.
- Get agreement on the version number to use for the backport release.

#### Version Numbering for Backport Releases

Featuretools uses [semantic versioning](https://semver.org/). Every release has a major, minor and patch version number, and are displayed like so: `<majorVersion>.<minorVersion>.<patchVersion>`. **A backport release will increment the patch version.**

This may be an intermediate number between two preexisting releases--for example a new `0.11.2` to be added between existing `0.11.1` and `0.12.0` releases. It can also be a new latest release--so `0.12.1` in the same situation--using only some of the commits that are present in the Future Release section of the release notes.

## 0.5. Create target branch for backport release

#### Checkout intended target commit

1. Checkout the agreed upon latest commit for targeting the release. If this is a previous release, you may checkout its tag with `git checkout v0.11.1`.

#### Create backport branch

1. Branch off of the target commit. For the branch name, please use the most recent major and minor versions to this commit (in this example `0` and `11` respectively), leaving the patch number as an `x`. This means that we would create `0.11.x` in the working example. This is necessary so that if any further backport releases are needed, we could continue to use this branch as the target. This branch is to be treated as `main` is treated in a normal release. It will be the target for our release.

This branch will be automatically protected (unless the version exceeds 9.Y.x or X.99.x, in which case contact the repo team about expanding the protection rules) to avoid unintended commits from making their way into the release undetected.

#### Port over desired commits

1. Create a feature branch off the backport branch. For the branch name, please use "backport_vX.Y.Z" as the naming scheme (e.g. "backport_v0.11.2). Doing so will bypass our release notes checkin test which requires all other PRs to add a release note entry.
2. Cherry-pick the desired commits onto `backport_v0.11.2`.
3. Create a pull request with the backport `0.11.x` branch as its target, get confirmation that the desired changes were added, and confirm that the CI checks pass.
4. Under the "Future Release" section in the release notes, include the ported over commits' release notes (don't remove them from their original location back on `main`), indicating that they are a backport of the original PR.

   ```
   Future Release
   ==============
       * Enhancements
       * Fixes
           * Fix bug (backport of :pr:`1110`)
       * Changes
       * Documentation Changes
       * Testing Changes

   Thanks to the following people for contributing to this release:
   ```

5. Merge the PR into the `0.11.x` backport branch

## 1. Create Featuretools Backport release on Github

With our backport branch `0.11.x` as our target, we now proceed with the release of `0.11.2`.

#### Create release branch

1. **Branch off of the backport branch `0.11.x`.** For the branch name, please use "release_vX.Y.Z" as the naming scheme (e.g. "release_v0.11.2"). Doing so will bypass our release notes checkin test which requires all other PRs to add a release note entry.

#### Bump version number

1. Bump `__version__` in `setup.py`, `featuretools/version.py`, and `featuretools/tests/test_version.py`.

#### Update Release Notes

1. Replace **"Future Release"** in `docs/source/release_notes.rst` with the current date

   ```
   v0.11.2 Sep 28, 2020
   ====================
   ```

2. Remove any unused Release Notes sections for this release (e.g. Fixes, Testing Changes)
3. Add yourself to the list of contributors to this release and **put the contributors in alphabetical order**
4. The release PR does not need to be mentioned in the list of changes
5. Add a commented out "Future Release" section with all of the Release Notes sections above the current section

   ```
   .. Future Release
     ==============
       * Enhancements
       * Fixes
       * Changes
       * Documentation Changes
       * Testing Changes

   .. Thanks to the following people for contributing to this release:
   ```

#### Create Release PR

A [release pr](https://github.com/alteryx/featuretools/pull/1915) should have the version number as the title and the release notes for that release as the PR body text. The contributors list is not necessary. The special sphinx docs syntax (:pr:\`547\`) needs to be changed to github link syntax (#547).

Checklist before merging:

- All tests are currently green on checkin and on `0.11.x`.
- The ReadtheDocs build for the release PR branch has passed, and the resulting docs contain the expected release notes.
- PR has been reviewed and approved.
- Confirm with the team that `0.11.x` will be frozen until step 2 (Github Release) is complete.

## 2. Create Github Release

After the release pull request has been merged into the `0.11.x` branch, it is time draft the github release. [Example release](https://github.com/alteryx/featuretools/releases/tag/v1.6.0)

- **The target should be the `0.11.x` backport branch**
- The tag should be the version number with a v prefix (e.g. v0.11.2)
- Release title is the same as the tag
- Release description should be the full Release Notes updates for the release, including the line thanking contributors. Contributors should also have their links changed from the docs syntax (:user:\`gsheni\`) to github syntax (@gsheni)
- This is not a pre-release
- Publishing the release will automatically upload the package to PyPI

Note that this backported release will show up on the repository's front page as the latest release even if there is technically a later `0.12.0` release.

## Release on conda-forge

If a later release exists, conda-forge will not automatically create a new PR in [conda-forge/featuretools-feedstock](https://github.com/conda-forge/featuretools-feedstock/pulls). Instead a PR will need to be manually created. You can do either of the following:

- Branch off of the 0.11.1 meta.yaml update commit for the 0.11.2 meta.yaml changes. This is "cleaner" and sometimes easier, but if migration files (like py310) have been added between 0.11.1 and 0.12.0 you will have to add them in and re-render yourself.
- Tack the 0.11.2 changes on after the 0.12.0 update commit in the feedstock repo. This means that if any of the boilerplate has changed, you do not have to manually re-add it yourself. An example of this can be seen from a Woodwork backport release [here](https://github.com/conda-forge/woodwork-feedstock/pull/32).

Once the PR is created:

1. Update requirements changes in `recipe/meta.yaml` - you may need to handle the version, source links, and SHA256 if you had to open the PR yourself. You will also need to update the requirements.
2. After tests pass, a maintainer will merge the PR in


================================================
FILE: docs/make.bat
================================================
@ECHO OFF

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
	set SPHINXBUILD=sphinx-build
)
set BUILDDIR=build
set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% source
set I18NSPHINXOPTS=%SPHINXOPTS% source
if NOT "%PAPER%" == "" (
	set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%
	set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS%
)

if "%1" == "" goto help

if "%1" == "help" (
	:help
	echo.Please use `make ^<target^>` where ^<target^> is one of
	echo.  html       to make standalone HTML files
	echo.  dirhtml    to make HTML files named index.html in directories
	echo.  singlehtml to make a single large HTML file
	echo.  pickle     to make pickle files
	echo.  json       to make JSON files
	echo.  htmlhelp   to make HTML files and a HTML help project
	echo.  qthelp     to make HTML files and a qthelp project
	echo.  devhelp    to make HTML files and a Devhelp project
	echo.  epub       to make an epub
	echo.  epub3      to make an epub3
	echo.  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter
	echo.  text       to make text files
	echo.  man        to make manual pages
	echo.  texinfo    to make Texinfo files
	echo.  gettext    to make PO message catalogs
	echo.  changes    to make an overview over all changed/added/deprecated items
	echo.  xml        to make Docutils-native XML files
	echo.  pseudoxml  to make pseudoxml-XML files for display purposes
	echo.  linkcheck  to check all external links for integrity
	echo.  doctest    to run all doctests embedded in the documentation if enabled
	echo.  coverage   to run coverage check of the documentation if enabled
	echo.  dummy      to check syntax errors of document sources
	goto end
)

if "%1" == "clean" (
	for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i
	del /q /s %BUILDDIR%\*
	goto end
)


REM Check if sphinx-build is available and fallback to Python version if any
%SPHINXBUILD% 1>NUL 2>NUL
if errorlevel 9009 goto sphinx_python
goto sphinx_ok

:sphinx_python

set SPHINXBUILD=python -m sphinx.__init__
%SPHINXBUILD% 2> nul
if errorlevel 9009 (
	echo.
	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
	echo.installed, then set the SPHINXBUILD environment variable to point
	echo.to the full path of the 'sphinx-build' executable. Alternatively you
	echo.may add the Sphinx directory to PATH.
	echo.
	echo.If you don't have Sphinx installed, grab it from
	echo.http://sphinx-doc.org/
	exit /b 1
)

:sphinx_ok


if "%1" == "html" (
	%SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The HTML pages are in %BUILDDIR%/html.
	goto end
)

if "%1" == "dirhtml" (
	%SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.
	goto end
)

if "%1" == "singlehtml" (
	%SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.
	goto end
)

if "%1" == "pickle" (
	%SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; now you can process the pickle files.
	goto end
)

if "%1" == "json" (
	%SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; now you can process the JSON files.
	goto end
)

if "%1" == "htmlhelp" (
	%SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; now you can run HTML Help Workshop with the ^
.hhp project file in %BUILDDIR%/htmlhelp.
	goto end
)

if "%1" == "qthelp" (
	%SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; now you can run "qcollectiongenerator" with the ^
.qhcp project file in %BUILDDIR%/qthelp, like this:
	echo.^> qcollectiongenerator %BUILDDIR%\qthelp\featuretools.qhcp
	echo.To view the help file:
	echo.^> assistant -collectionFile %BUILDDIR%\qthelp\featuretools.ghc
	goto end
)

if "%1" == "devhelp" (
	%SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished.
	goto end
)

if "%1" == "epub" (
	%SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The epub file is in %BUILDDIR%/epub.
	goto end
)

if "%1" == "epub3" (
	%SPHINXBUILD% -b epub3 %ALLSPHINXOPTS% %BUILDDIR%/epub3
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The epub3 file is in %BUILDDIR%/epub3.
	goto end
)

if "%1" == "latex" (
	%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; the LaTeX files are in %BUILDDIR%/latex.
	goto end
)

if "%1" == "latexpdf" (
	%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
	cd %BUILDDIR%/latex
	make all-pdf
	cd %~dp0
	echo.
	echo.Build finished; the PDF files are in %BUILDDIR%/latex.
	goto end
)

if "%1" == "latexpdfja" (
	%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
	cd %BUILDDIR%/latex
	make all-pdf-ja
	cd %~dp0
	echo.
	echo.Build finished; the PDF files are in %BUILDDIR%/latex.
	goto end
)

if "%1" == "text" (
	%SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The text files are in %BUILDDIR%/text.
	goto end
)

if "%1" == "man" (
	%SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The manual pages are in %BUILDDIR%/man.
	goto end
)

if "%1" == "texinfo" (
	%SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo.
	goto end
)

if "%1" == "gettext" (
	%SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The message catalogs are in %BUILDDIR%/locale.
	goto end
)

if "%1" == "changes" (
	%SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes
	if errorlevel 1 exit /b 1
	echo.
	echo.The overview file is in %BUILDDIR%/changes.
	goto end
)

if "%1" == "linkcheck" (
	%SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck
	if errorlevel 1 exit /b 1
	echo.
	echo.Link check complete; look for any errors in the above output ^
or in %BUILDDIR%/linkcheck/output.txt.
	goto end
)

if "%1" == "doctest" (
	%SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest
	if errorlevel 1 exit /b 1
	echo.
	echo.Testing of doctests in the sources finished, look at the ^
results in %BUILDDIR%/doctest/output.txt.
	goto end
)

if "%1" == "coverage" (
	%SPHINXBUILD% -b coverage %ALLSPHINXOPTS% %BUILDDIR%/coverage
	if errorlevel 1 exit /b 1
	echo.
	echo.Testing of coverage in the sources finished, look at the ^
results in %BUILDDIR%/coverage/python.txt.
	goto end
)

if "%1" == "xml" (
	%SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The XML files are in %BUILDDIR%/xml.
	goto end
)

if "%1" == "pseudoxml" (
	%SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml.
	goto end
)

if "%1" == "dummy" (
	%SPHINXBUILD% -b dummy %ALLSPHINXOPTS% %BUILDDIR%/dummy
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. Dummy builder generates no files.
	goto end
)

:end


================================================
FILE: docs/notebook_version_standardizer.py
================================================
import json
import os

import click

DOCS_PATH = os.path.join(os.path.dirname(os.path.abspath(__file__)), "source")


def _get_ipython_notebooks(docs_source):
    directories_to_skip = ["_templates", "generated", ".ipynb_checkpoints"]
    notebooks = []
    for root, _, filenames in os.walk(docs_source):
        if any(dir_ in root for dir_ in directories_to_skip):
            continue
        for filename in filenames:
            if filename.endswith(".ipynb"):
                notebooks.append(os.path.join(root, filename))
    return notebooks


def _check_delete_empty_cell(notebook, delete=True):
    with open(notebook, "r") as f:
        source = json.load(f)
    cell = source["cells"][-1]
    if cell["cell_type"] == "code" and cell["source"] == []:
        # this is an empty cell, which we should delete
        if delete:
            source["cells"] = source["cells"][:-1]
        else:
            return False
    if delete:
        with open(notebook, "w") as f:
            json.dump(source, f, ensure_ascii=False, indent=1)
    else:
        return True


def _check_execution_and_output(notebook):
    with open(notebook, "r") as f:
        source = json.load(f)
    for cells in source["cells"]:
        if cells["cell_type"] == "code" and (
            cells["execution_count"] is not None or cells["outputs"] != []
        ):
            return False
    return True


def _check_python_version(notebook, default_version):
    with open(notebook, "r") as f:
        source = json.load(f)
    if source["metadata"]["language_info"]["version"] != default_version:
        return False
    return True


def _fix_python_version(notebook, default_version):
    with open(notebook, "r") as f:
        source = json.load(f)
    source["metadata"]["language_info"]["version"] = default_version
    with open(notebook, "w") as f:
        json.dump(source, f, ensure_ascii=False, indent=1)


def _fix_execution_and_output(notebook):
    with open(notebook, "r") as f:
        source = json.load(f)
    for cells in source["cells"]:
        if cells["cell_type"] == "code" and cells["execution_count"] is not None:
            cells["execution_count"] = None
            cells["outputs"] = []
    source["metadata"]["kernelspec"]["display_name"] = "Python 3"
    source["metadata"]["kernelspec"]["name"] = "python3"
    with open(notebook, "w") as f:
        json.dump(source, f, ensure_ascii=False, indent=1)


def _get_notebooks_with_executions_and_empty(notebooks, default_version="3.9.2"):
    executed = []
    empty_last_cell = []
    versions = []
    for notebook in notebooks:
        if not _check_execution_and_output(notebook):
            executed.append(notebook)
        if not _check_delete_empty_cell(notebook, delete=False):
            empty_last_cell.append(notebook)
        if not _check_python_version(notebook, default_version):
            versions.append(notebook)
    return (executed, empty_last_cell, versions)


def _fix_versions(notebooks, default_version="3.9.2"):
    for notebook in notebooks:
        _fix_python_version(notebook, default_version)


def _remove_notebook_empty_last_cell(notebooks):
    for notebook in notebooks:
        _check_delete_empty_cell(notebook, delete=True)


def _standardize_outputs(notebooks):
    for notebook in notebooks:
        _fix_execution_and_output(notebook)


@click.group()
def cli():
    """no-op"""


@cli.command()
def standardize():
    notebooks = _get_ipython_notebooks(DOCS_PATH)
    (
        executed_notebooks,
        empty_cells,
        versions,
    ) = _get_notebooks_with_executions_and_empty(notebooks)
    if executed_notebooks:
        _standardize_outputs(executed_notebooks)
        executed_notebooks = ["\t" + notebook for notebook in executed_notebooks]
        executed_notebooks = "\n".join(executed_notebooks)
        click.echo(f"Removed the outputs for:\n {executed_notebooks}")
    if empty_cells:
        _remove_notebook_empty_last_cell(empty_cells)
        empty_cells = ["\t" + notebook for notebook in empty_cells]
        empty_cells = "\n".join(empty_cells)
        click.echo(f"Removed the empty cells for:\n {empty_cells}")
    if versions:
        _fix_versions(versions)
        versions = ["\t" + notebook for notebook in versions]
        versions = "\n".join(versions)
        click.echo(f"Fixed python versions for:\n {versions}")


@cli.command()
def check_execution():
    notebooks = _get_ipython_notebooks(DOCS_PATH)
    (
        executed_notebooks,
        empty_cells,
        versions,
    ) = _get_notebooks_with_executions_and_empty(notebooks)
    if executed_notebooks:
        executed_notebooks = ["\t" + notebook for notebook in executed_notebooks]
        executed_notebooks = "\n".join(executed_notebooks)
        raise SystemExit(
            f"The following notebooks have executed outputs:\n {executed_notebooks}\n"
            "Please run make lint-fix to fix this.",
        )
    if empty_cells:
        empty_cells = ["\t" + notebook for notebook in empty_cells]
        empty_cells = "\n".join(empty_cells)
        raise SystemExit(
            f"The following notebooks have empty cells at the end:\n {empty_cells}\n"
            "Please run make lint-fix to fix this.",
        )
    if versions:
        versions = ["\t" + notebook for notebook in versions]
        versions = "\n".join(versions)
        raise SystemExit(
            f"The following notebooks have the wrong Python version: \n {versions}\n"
            "Please run make lint-fix to fix this.",
        )


if __name__ == "__main__":
    cli()


================================================
FILE: docs/pull_request_template.md
================================================
### Pull Request Description
(replace this text with your description)

-----
*After creating the pull request: in order to pass the **release_notes_updated** check you will need to update the "Future Release" section of* `docs/source/release_notes.rst` *to include this pull request.*


================================================
FILE: docs/source/_static/style.css
================================================
.footer {
    background-color: #0D2345;
    padding-bottom: 40px;
    padding-top: 40px;
    width: 100%;
}

.footer-cell-1 {
    grid-row: 1;
    grid-column: 1 / 3;
}

.footer-cell-2 {
    grid-row: 1;
    grid-column: 4;
    margin-bottom: 15px;
    text-align: right;
}

.footer-cell-3 {
    grid-row: 2;
    grid-column: 1 / 5;
}

.footer-cell-4 {
    grid-row: 3;
    grid-column: 1 / 3;
}

.footer-container {
    display: grid;
    margin-left: 10%;
    margin-right: 10%;
}

.footer-image-alteryx {
    padding-top: 22px;
    width: 270px;
}

.footer-image-copyright {
    width: 180px;
}

.footer-image-github {
    width: 50px;
}

.footer-image-twitter {
    width: 60px;
}

.footer-line {
    border-top: 2px solid white;
    margin-left: 7px;
    margin-right: 15px;
}


================================================
FILE: docs/source/api_reference.rst
================================================
.. _api_ref:

API Reference
=============

.. currentmodule:: featuretools

Demo Datasets
~~~~~~~~~~~~~
.. currentmodule:: featuretools.demo


.. autosummary::
    :toctree: generated/

    load_retail
    load_mock_customer
    load_flight
    load_weather

Deep Feature Synthesis
~~~~~~~~~~~~~~~~~~~~~~
.. currentmodule:: featuretools

.. autosummary::
    :toctree: generated/

    dfs
    get_valid_primitives

Timedelta
~~~~~~~~~
.. currentmodule:: featuretools

.. autosummary::
    :toctree: generated/

    Timedelta

Time utils
~~~~~~~~~~
.. currentmodule:: featuretools

.. autosummary::
    :toctree: generated/

    make_temporal_cutoffs


Feature Primitives
~~~~~~~~~~~~~~~~~~

Primitive Types
---------------
.. currentmodule:: featuretools.primitives

.. autosummary::
    :toctree: generated/

    TransformPrimitive
    AggregationPrimitive


.. _api_ref.aggregation_features:

Aggregation Primitives
----------------------
.. autosummary::
    :toctree: generated/

    All
    Any
    AverageCountPerUnique
    AvgTimeBetween
    Count
    CountAboveMean
    CountBelowMean
    CountGreaterThan
    CountInsideNthSTD
    CountInsideRange
    CountLessThan
    CountOutsideNthSTD
    CountOutsideRange
    DateFirstEvent
    Entropy
    First
    FirstLastTimeDelta
    HasNoDuplicates
    IsMonotonicallyDecreasing
    IsMonotonicallyIncreasing
    IsUnique
    Kurtosis
    Last
    Max
    MaxConsecutiveFalse
    MaxConsecutiveNegatives
    MaxConsecutivePositives
    MaxConsecutiveTrue
    MaxConsecutiveZeros
    MaxCount
    MaxMinDelta
    Mean
    Median
    MedianCount
    Min
    MinCount
    Mode
    NMostCommon
    NMostCommonFrequency
    NUniqueDays
    NUniqueDaysOfCalendarYear
    NUniqueMonths
    NUniqueWeeks
    NumConsecutiveGreaterMean
    NumConsecutiveLessMean
    NumFalseSinceLastTrue
    NumPeaks
    NumTrue
    NumTrueSinceLastFalse
    NumUnique
    NumZeroCrossings
    PercentTrue
    PercentUnique
    Skew
    Std
    Sum
    TimeSinceFirst
    TimeSinceLast
    TimeSinceLastFalse
    TimeSinceLastMax
    TimeSinceLastMin
    TimeSinceLastTrue
    Trend
    Variance

Transform Primitives
--------------------
Binary Transform Primitives
***************************
.. autosummary::
    :toctree: generated/

    AddNumeric
    AddNumericScalar
    DivideByFeature
    DivideNumeric
    DivideNumericScalar
    Equal
    EqualScalar
    GreaterThan
    GreaterThanEqualTo
    GreaterThanEqualToScalar
    GreaterThanScalar
    LessThan
    LessThanEqualTo
    LessThanEqualToScalar
    LessThanScalar
    ModuloByFeature
    ModuloNumeric
    ModuloNumericScalar
    MultiplyBoolean
    MultiplyNumeric
    MultiplyNumericBoolean
    MultiplyNumericScalar
    NotEqual
    NotEqualScalar
    ScalarSubtractNumericFeature
    SubtractNumeric
    SubtractNumericScalar


Combine features
****************
.. autosummary::
    :toctree: generated/

    IsIn
    And
    Or
    Not


.. _api_ref.cumulative_features:

Cumulative Transform Primitives
*******************************
.. autosummary::
    :toctree: generated/

    Diff
    DiffDatetime
    TimeSincePrevious
    CumCount
    CumSum
    CumMean
    CumMin
    CumMax
    CumulativeTimeSinceLastFalse
    CumulativeTimeSinceLastTrue


Datetime Transform Primitives
*****************************
.. autosummary::
    :toctree: generated/

    Age
    DateToHoliday
    DateToTimeZone
    Day
    DayOfYear
    DaysInMonth
    DistanceToHoliday
    Hour
    IsFederalHoliday
    IsFirstWeekOfMonth
    IsLeapYear
    IsLunchTime
    IsMonthEnd
    IsMonthStart
    IsQuarterEnd
    IsQuarterStart
    IsWeekend
    IsWorkingHours
    IsYearEnd
    IsYearStart
    Minute
    Month
    NthWeekOfMonth
    PartOfDay
    Quarter
    Season
    Second
    TimeSince
    Week
    Weekday
    Year


Email, URL and File Transform Primitives
****************************************
.. autosummary::
    :toctree: generated/

    EmailAddressToDomain
    FileExtension
    IsFreeEmailDomain
    URLToDomain
    URLToProtocol
    URLToTLD


Exponential Transform Primitives
********************************
.. autosummary::
    :toctree: generated/

    ExponentialWeightedAverage
    ExponentialWeightedSTD
    ExponentialWeightedVariance


General Transform Primitives
****************************
.. autosummary::
    :toctree: generated/

    AbsoluteDiff
    Absolute
    Cosine
    IsNull
    NaturalLogarithm
    Negate
    Percentile
    PercentChange
    RateOfChange
    SameAsPrevious
    SavgolFilter
    Sine
    SquareRoot
    Tangent
    Variance

Location Transform Primitives
*****************************
.. autosummary::
   :toctree: generated/

    CityblockDistance
    GeoMidpoint
    Haversine
    IsInGeoBox
    Latitude
    Longitude

Name Transform Primitives
*************************
.. autosummary::
   :toctree: generated/

    FullNameToFirstName
    FullNameToLastName
    FullNameToTitle

NaturalLanguage Transform Primitives
************************************
.. autosummary::
   :toctree: generated/

   CountString
   MeanCharactersPerWord
   MedianWordLength
   NumCharacters
   NumUniqueSeparators
   NumWords
   NumberOfCommonWords
   NumberOfHashtags
   NumberOfMentions
   NumberOfUniqueWords
   NumberOfWordsInQuotes
   PunctuationCount
   TitleWordCount
   TotalWordLength
   UpperCaseCount
   UpperCaseWordCount
   WhitespaceCount

Postal Code Primitives
**********************
.. autosummary::
    :toctree: generated/

    OneDigitPostalCode
    TwoDigitPostalCode

Time Series Transform Primitives
********************************
.. autosummary::
    :toctree: generated/

    ExpandingCount
    ExpandingMax
    ExpandingMean
    ExpandingMin
    ExpandingSTD
    ExpandingTrend
    Lag
    RollingCount
    RollingMax
    RollingMean
    RollingMin
    RollingOutlierCount
    RollingSTD
    RollingTrend


Feature methods
---------------
.. currentmodule:: featuretools.feature_base
.. autosummary::
    :toctree: generated/

    FeatureBase.rename
    FeatureBase.get_depth


Feature calculation
~~~~~~~~~~~~~~~~~~~~
.. currentmodule:: featuretools
.. autosummary::
    :toctree: generated/

    calculate_feature_matrix
    .. approximate_features

Feature descriptions
~~~~~~~~~~~~~~~~~~~~~
.. currentmodule:: featuretools
.. autosummary::
    :toctree: generated/

    describe_feature

Feature visualization
~~~~~~~~~~~~~~~~~~~~~~
.. currentmodule:: featuretools
.. autosummary::
    :toctree: generated/

    graph_feature

Feature encoding
~~~~~~~~~~~~~~~~~
.. currentmodule:: featuretools
.. autosummary::
    :toctree: generated/

    encode_features

Feature Selection
~~~~~~~~~~~~~~~~~
.. currentmodule:: featuretools.selection
.. autosummary::
    :toctree: generated/

    remove_low_information_features
    remove_highly_correlated_features
    remove_highly_null_features
    remove_single_value_features

Feature Matrix utils
~~~~~~~~~~~~~~~~~~~~
.. currentmodule:: featuretools.computational_backends
.. autosummary::
    :toctree: generated/

    replace_inf_values


Saving and Loading Features
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. currentmodule:: featuretools
.. autosummary::
    :toctree: generated/

    save_features
    load_features

.. _api_ref.dataset:

EntitySet, Relationship
~~~~~~~~~~~~~~~~~~~~~~~

Constructors
------------
.. currentmodule:: featuretools
.. autosummary::
    :toctree: generated/

    EntitySet
    Relationship

EntitySet load and prepare data
-------------------------------
.. autosummary::
    :toctree: generated/

    EntitySet.add_dataframe
    EntitySet.add_interesting_values
    EntitySet.add_last_time_indexes
    EntitySet.add_relationship
    EntitySet.add_relationships
    EntitySet.concat
    EntitySet.normalize_dataframe
    EntitySet.set_secondary_time_index
    EntitySet.replace_dataframe

EntitySet serialization
-------------------------------
.. currentmodule:: featuretools
.. autosummary::
    :toctree: generated/

    read_entityset

.. currentmodule:: featuretools.entityset
.. autosummary::
    :toctree: generated/

    EntitySet.to_csv
    EntitySet.to_pickle
    EntitySet.to_parquet

EntitySet query methods
-----------------------
.. autosummary::
    :toctree: generated/

    EntitySet.__getitem__
    EntitySet.find_backward_paths
    EntitySet.find_forward_paths
    EntitySet.get_forward_dataframes
    EntitySet.get_backward_dataframes
    EntitySet.query_by_values

EntitySet visualization
-----------------------
.. autosummary::
    :toctree: generated/

    EntitySet.plot

Relationship attributes
-----------------------
.. autosummary::
    :toctree: generated/

    Relationship.parent_column
    Relationship.child_column
    Relationship.parent_dataframe
    Relationship.child_dataframe

Data Type Util Methods
----------------------
.. currentmodule:: featuretools
.. autosummary::
    :toctree: generated/

    list_logical_types
    list_semantic_tags

Primitive Util Methods
----------------------
.. currentmodule:: featuretools
.. autosummary::
    :toctree: generated/

    get_recommended_primitives
    list_primitives
    summarize_primitives


================================================
FILE: docs/source/conf.py
================================================
# -*- coding: utf-8 -*-
#
# featuretools documentation build configuration file, created by
# sphinx-quickstart on Thu May 19 20:40:30 2016.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.

import os
import shutil
import subprocess
import sys
from pathlib import Path

import featuretools

# run setup script
path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "setup.py")
subprocess.check_call([sys.executable, path])

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath("../featuretools"))

# -- General configuration ------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
# needs_sphinx = '1.0'

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
    "sphinx.ext.autodoc",
    "sphinx.ext.autosummary",
    "sphinx.ext.napoleon",
    "sphinx.ext.ifconfig",
    "sphinx.ext.githubpages",
    "nbsphinx",
    "IPython.sphinxext.ipython_console_highlighting",
    "IPython.sphinxext.ipython_directive",
    "sphinx.ext.extlinks",
    "sphinx.ext.viewcode",
    "sphinx.ext.graphviz",
    "sphinx_inline_tabs",
    "sphinx_copybutton",
    "myst_parser",
]


# ipython_mplbackend = None

ipython_execlines = ["import pandas as pd", "pd.set_option('display.width', 1000000)"]

# autosummary_generate=True
autosummary_generate = ["api_reference.rst"]


# Add any paths that contain templates here, relative to this directory.
templates_path = ["templates"]

# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
# source_suffix = ['.rst', '.md']

# The encoding of source files.
# source_encoding = 'utf-8-sig'

# The master toctree document.
master_doc = "index"

# General information about the project.
project = "Featuretools"
copyright = "2019, Feature Labs. BSD License"
author = "Feature Labs, Inc."
latex_documents = [
    (master_doc, "featuretools.tex", "test Documentation", "test", "manual"),
]
latex_elements = {
    "preamble": r"""
\usepackage[utf8]{inputenc}
""",
}

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = featuretools.__version__
# The full version, including alpha/beta/rc tags.
release = featuretools.__version__

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = "en"

# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
# today = ''
# Else, today_fmt is used as the format for a strftime call.
# today_fmt = '%B %d, %Y'

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This patterns also effect to html_static_path and html_extra_path
exclude_patterns = ["**.ipynb_checkpoints"]

# The reST default role (used for this markup: `text`) to use for all
# documents.
# default_role = None

# If true, '()' will be appended to :func: etc. cross-reference text.
# add_function_parentheses = True

# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
# add_module_names = True

# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
# show_authors = False

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = "sphinx"

# A list of ignored prefixes for module index sorting.
# modindex_common_prefix = []

# If true, keep warnings as "system message" paragraphs in the built documents.
# keep_warnings = False

# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False


# -- Options for HTML output ----------------------------------------------

# The theme to use for HTML and HTML Help pages.  See the documentation for
# a list of builtin themes.
html_theme = "pydata_sphinx_theme"

# Theme options are theme-specific and customize the look and feel of a theme
# further.  For a list of options available for each theme, see the
# documentation.
html_theme_options = {
    "pygment_light_style": "tango",
    "pygment_dark_style": "native",
    "icon_links": [
        {
            "name": "GitHub",
            "url": "https://github.com/alteryx/featuretools",
            "icon": "fab fa-github-square",
            "type": "fontawesome",
        },
        {
            "name": "Twitter",
            "url": "https://twitter.com/AlteryxOSS",
            "icon": "fab fa-twitter-square",
            "type": "fontawesome",
        },
        {
            "name": "Slack",
            "url": "https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA",
            "icon": "fab fa-slack",
            "type": "fontawesome",
        },
        {
            "name": "StackOverflow",
            "url": "https://stackoverflow.com/questions/tagged/featuretools",
            "icon": "fab fa-stack-overflow",
            "type": "fontawesome",
        },
    ],
    "collapse_navigation": False,
    "navigation_depth": 2,
}

# Add any paths that contain custom themes here, relative to this directory.
# html_theme_path = []

# The name for this set of Sphinx documents.
# "<project> v<release> documentation" by default.
# html_title = u'featuretools v0.1'

# A shorter title for the navigation bar.  Default is the same as html_title.
# html_short_title = None

# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
html_logo = "_static/images/featuretools_nav2.svg"

# The name of an image file (relative to this directory) to use as a favicon of
# the docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
html_favicon = "_static/images/favicon.ico"

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]

# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.
# html_extra_path = []

# If not None, a 'Last updated on:' timestamp is inserted at every page
# bottom, using the given strftime format.
# The empty string is equivalent to '%b %d, %Y'.
# html_last_updated_fmt = None

# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
# html_use_smartypants = True

# Custom sidebar templates, maps document names to template names.
html_sidebars = {
    "**": ["globaltoc.html", "relations.html", "sourcelink.html", "searchbox.html"],
}


# Additional templates that should be rendered to pages, maps page names to
# template names.
# html_additional_pages = {}

# If false, no module index is generated.
# html_domain_indices = True

# If false, no index is generated.
# html_use_index = True

# If true, the index is split into individual pages for each letter.
# html_split_index = False

# If true, links to the reST sources are added to the pages.
# html_show_sourcelink = True

# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
html_show_sphinx = False

# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
# html_show_copyright = True

# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it.  The value of this option must be the
# base URL from which the finished HTML is served.
# html_use_opensearch = ''

# This is the file name suffix for HTML files (e.g. ".xhtml").
# html_file_suffix = None

# Language to be used for generating the HTML full-text search index.
# Sphinx supports the following languages:
#   'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja'
#   'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr', 'zh'
# html_search_language = 'en'

# A dictionary with options for the search language support, empty by default.
# 'ja' uses this config value.
# 'zh' user can custom change `jieba` dictionary path.
# html_search_options = {'type': 'default'}

# The name of a javascript file (relative to the configuration directory) that
# implements a search results scorer. If empty, the default will be used.
# html_search_scorer = 'scorer.js'

# Output file base name for HTML help builder.
htmlhelp_basename = "featuretoolsdoc"

# -- Options for Markdown files ----------------------------------------------

myst_admonition_enable = True
myst_deflist_enable = True
myst_heading_anchors = 3

# -- Options for Sphinx Copy Button ------------------------------------------

copybutton_prompt_text = "myinputprompt"
copybutton_prompt_text = r">>> |\.\.\. |\$ |In \[\d*\]: | {2,5}\.\.\.: | {5,8}: "
copybutton_prompt_is_regexp = True

# -- Options for LaTeX output ---------------------------------------------

latex_elements = {
    # The paper size ('letterpaper' or 'a4paper').
    #'papersize': 'letterpaper',
    # The font size ('10pt', '11pt' or '12pt').
    #'pointsize': '10pt',
    # Additional stuff for the LaTeX preamble.
    #'preamble': '',
    # Latex figure (float) alignment
    #'figure_align': 'htbp',
}

# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
#  author, documentclass [howto, manual, or own class]).
latex_documents = [
    (
        master_doc,
        "featuretools.tex",
        "Featuretools Documentation",
        "Feature Labs, Inc.",
        "manual",
    ),
]

# The name of an image file (relative to this directory) to place at the top of
# the title page.
# latex_logo = None

# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
# latex_use_parts = False

# If true, show page references after internal links.
# latex_show_pagerefs = False

# If true, show URL addresses after external links.
# latex_show_urls = False

# Documents to append as an appendix to all manuals.
# latex_appendices = []

# If false, no module index is generated.
# latex_domain_indices = True


# -- Options for manual page output ---------------------------------------

# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [(master_doc, "featuretools", "featuretools Documentation", [author], 1)]

# If true, show URL addresses after external links.
# man_show_urls = False


# -- Options for Texinfo output -------------------------------------------

# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
#  dir menu entry, description, category)
texinfo_documents = [
    (
        master_doc,
        "featuretools",
        "featuretools Documentation",
        author,
        "featuretools",
        "One line description of project.",
        "Miscellaneous",
    ),
]

# Documents to append as an appendix to all manuals.
# texinfo_appendices = []

# If false, no module index is generated.
# texinfo_domain_indices = True

# How to display URL addresses: 'footnote', 'no', or 'inline'.
# texinfo_show_urls = 'footnote'

# If true, do not generate a @detailmenu in the "Top" node's menu.
# texinfo_no_detailmenu = False

nbsphinx_execute = "auto"

extlinks = {
    "issue": ("https://github.com/alteryx/featuretools/issues/%s", "GH#%s"),
    "pr": ("https://github.com/alteryx/featuretools/pull/%s", "GH#%s"),
    "user": ("https://github.com/%s", "@%s"),
}

# Napoleon settings
napoleon_google_docstring = True
napoleon_numpy_docstring = True
napoleon_include_init_with_doc = False
napoleon_include_private_with_doc = False
napoleon_include_special_with_doc = True
napoleon_use_admonition_for_examples = False
napoleon_use_admonition_for_notes = False
napoleon_use_admonition_for_references = False
napoleon_use_ivar = False
napoleon_use_param = True
napoleon_use_rtype = True


def setup(app):
    home_dir = os.environ.get("HOME", "/")
    ipython_p = Path(home_dir + "/.ipython/profile_default/startup")
    ipython_p.mkdir(parents=True, exist_ok=True)
    file_p = os.path.abspath(os.path.dirname(__file__))
    shutil.copy(
        file_p + "/set-headers.py",
        home_dir + "/.ipython/profile_default/startup",
    )
    app.add_css_file("style.css")


================================================
FILE: docs/source/getting_started/afe.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deep Feature Synthesis\n",
    "\n",
    "Deep Feature Synthesis (DFS) is an automated method for performing feature engineering on relational and temporal data.\n",
    "\n",
    "## Input Data\n",
    "\n",
    "Deep Feature Synthesis requires structured datasets in order to perform feature engineering. To demonstrate the capabilities of DFS, we will use a mock customer transactions dataset.\n"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note ::\n",
    "\n",
    "  Before using DFS, it is recommended that you prepare your data as an :class:`EntitySet`.  See :doc:`using_entitysets` to learn how."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import featuretools as ft\n",
    "\n",
    "es = ft.demo.load_mock_customer(return_entityset=True)\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once data is prepared as an `.EntitySet`, we are ready to automatically generate features for a target dataframe - e.g. `customers`.\n",
    "\n",
    "## Running DFS\n",
    "\n",
    "Typically, without automated feature engineering, a data scientist would write code to aggregate data for a customer, and apply different statistical functions resulting in features quantifying the customer's behavior. In this example, an expert might be interested in features such as: *total number of sessions* or *month the customer signed up*.\n",
    "\n",
    "These features can be generated by DFS when we specify the target_dataframe as `customers` and `\"count\"` and `\"month\"` as primitives."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"count\"],\n",
    "    trans_primitives=[\"month\"],\n",
    "    max_depth=1,\n",
    ")\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the example above, `\"count\"` is an **aggregation primitive** because it computes a single value based on many sessions related to one customer. `\"month\"` is called a **transform primitive** because it takes one value for a customer transforms it to another."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note ::\n",
    "\n",
    "  Feature primitives are a fundamental component to Featuretools. To learn more read :doc:`primitives`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating \"Deep Features\"\n",
    "\n",
    "The name Deep Feature Synthesis comes from the algorithm's ability to stack primitives to generate more complex features. Each time we stack a primitive we increase the \"depth\" of a feature. The `max_depth` parameter controls the maximum depth of the features returned by DFS. Let us try running DFS with `max_depth=2`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"mean\", \"sum\", \"mode\"],\n",
    "    trans_primitives=[\"month\", \"hour\"],\n",
    "    max_depth=2,\n",
    ")\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/markdown"
   },
   "source": [
    "With a depth of 2, a number of features are generated using the supplied primitives. The algorithm to synthesize these definitions is described in this [paper](https://www.jmaxkanter.com/papers/DSAA_DSM_2015.pdf). In the returned feature matrix, let us understand one of the depth 2 features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix[[\"MEAN(sessions.SUM(transactions.amount))\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "For each customer this feature\n",
    "\n",
    "1. calculates the ``sum`` of all transaction amounts per session to get total amount per session,\n",
    "2. then applies the ``mean`` to the total amounts across multiple sessions to identify the *average amount spent per session*\n",
    "\n",
    "We call this feature a \"deep feature\" with a depth of 2.\n",
    "\n",
    "Let's look at another depth 2 feature that calculates for every customer *the most common hour of the day when they start a session*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix[[\"MODE(sessions.HOUR(session_start))\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each customer this feature calculates\n",
    "\n",
    "1. The `hour` of the day each of his or her sessions started, then\n",
    "2. uses the statistical function `mode` to identify the most common hour he or she started a session\n",
    "\n",
    "Stacking results in features that are more expressive than individual primitives themselves. This enables the automatic creation of complex patterns for machine learning."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note ::\n",
    "    You can graphically visualize the lineage of a feature by calling :func:`featuretools.graph_feature` on it. You can also generate an English description of the feature with :func:`featuretools.describe_feature`. See :doc:`/guides/feature_descriptions` for more details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Changing Target DataFrame\n",
    "\n",
    "DFS is powerful because we can create a feature matrix for any dataframe in our dataset. If we switch our target dataframe to \"sessions\", we can synthesize features for each session instead of each customer. Now, we can use these features to predict the outcome of a session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"sessions\",\n",
    "    agg_primitives=[\"mean\", \"sum\", \"mode\"],\n",
    "    trans_primitives=[\"month\", \"hour\"],\n",
    "    max_depth=2,\n",
    ")\n",
    "feature_matrix.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "As we can see, DFS will also build deep features based on a parent dataframe, in this case the customer of a particular session. For example, the feature below calculates the mean transaction amount of the customer of the session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix[[\"customers.MEAN(transactions.amount)\"]].head(5)"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Improve feature output\n",
    "~~~~~~~~~~~~~~~~~~~~~~\n",
    "\n",
    "To learn about the parameters to change in DFS read :doc:`/guides/tuning_dfs`.\n",
    "\n",
    "\n",
    ".. here it maybe nice to have a table that shows the number of features generated for AirBnB and other KAGGLE datasets once we have them. We can also give the user access to it."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: docs/source/getting_started/getting_started_index.rst
================================================
Getting Started
---------------

For a quick introduction to Featuretools, check out our :ref:`5 minute quick start guide <quick-start>`.

How to start working with Featuretools; the main concepts:

.. toctree::
   :maxdepth: 1

   using_entitysets
   afe
   primitives
   woodwork_types
   handling_time


================================================
FILE: docs/source/getting_started/handling_time.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a8104f18",
   "metadata": {},
   "source": [
    "# Handling Time\n",
    "\n",
    "\n",
    "When performing feature engineering with temporal data, carefully selecting the data that is used for any calculation is paramount. By annotating dataframes with a Woodwork **time index** column and providing a **cutoff time** during feature calculation, Featuretools will automatically filter out any data after the cutoff time before running any calculations."
   ]
  },
  {
   "cell_type": "raw",
   "id": "9cd9cb82",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note::\n",
    "        This guide focuses on performing feature engineering on temporal data, but it is not specific to feature engineering for time series problems, which are their own class of machine learning problems. A guide on **using Featuretools for time series feature engineering** can be found `here <../guides/time_series.ipynb>`_."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32c2ae4d",
   "metadata": {},
   "source": [
    "## What is the Time Index?\n",
    "\n",
    "\n",
    "The time index is the column in the data that specifies when the data in each row became known. For example, let's examine a table of customer transactions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ebbcb40b",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pd.options.display.max_columns = 200"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8202f11a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import featuretools as ft\n",
    "\n",
    "es = ft.demo.load_mock_customer(return_entityset=True, random_seed=0)\n",
    "es[\"transactions\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd26087b",
   "metadata": {},
   "source": [
    "In this table, there is one row for every transaction and a ``transaction_time`` column that specifies when the transaction took place. This means that ``transaction_time`` is the time index because it indicates when the information in each row became known and available for feature calculations. For now, ignore the ``_ft_last_time`` column. That is a featuretools-generated column that will be discussed later on.\n",
    "\n",
    "However, not every datetime column is a time index. Consider the ``customers`` dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87dd0a0d",
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"customers\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c89d548d",
   "metadata": {},
   "source": [
    "Here, we have two time columns, ``join_date`` and ``birthday``. While either column might be useful for making features, the ``join_date`` should be used as the time index because it indicates when that customer first became available in the dataset."
   ]
  },
  {
   "cell_type": "raw",
   "id": "85b51512",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. important::\n",
    "\n",
    "    The **time index** is defined as the first time that any information from a row can be used. If a cutoff time is specified when calculating features, rows that have a later value for the time index are automatically ignored."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00e3c365",
   "metadata": {},
   "source": [
    "# What is the Cutoff Time?\n",
    "The **cutoff_time** specifies the last point in time that a row’s data can be used for a feature calculation. Any data after this point in time will be filtered out before calculating features.\n",
    "\n",
    "For example, let's consider a dataset of timestamped customer transactions, where we want to predict whether customers ``1``, ``2`` and ``3`` will spend $500 between ``04:00`` on January 1 and the end of the day. When building features for this prediction problem, we need to ensure that no data after ``04:00`` is used in our calculations.\n",
    "\n",
    "<img src=\"../_static/images/retail_ct.png\" width=\"400\" align=\"center\" alt=\"retail cutoff time diagram\">"
   ]
  },
  {
   "cell_type": "raw",
   "id": "19855e77",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "We pass the cutoff time to :func:`featuretools.dfs` or :func:`featuretools.calculate_feature_matrix` using the ``cutoff_time`` argument like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a0717f7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "fm, features = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    cutoff_time=pd.Timestamp(\"2014-1-1 04:00\"),\n",
    "    instance_ids=[1, 2, 3],\n",
    "    cutoff_time_in_index=True,\n",
    ")\n",
    "fm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "feafa08d",
   "metadata": {},
   "source": [
    "Even though the entityset contains the complete transaction history for each customer, only data with a time index up to and including the cutoff time was used to calculate the features above.\n",
    "\n",
    "## Using a Cutoff Time DataFrame\n",
    "\n",
    "\n",
    "Oftentimes, the training examples for machine learning will come from different points in time. To specify a unique cutoff time for each row of the resulting feature matrix, we can pass a dataframe which includes one column for the instance id and another column for the corresponding cutoff time. These columns can be in any order, but they must be named properly. The column with the instance ids must either be named ``instance_id`` or have the same name as the target dataframe ``index``. The column with the cutoff time values must either be named ``time`` or have the same name as the target dataframe ``time_index``.\n",
    "\n",
    "The column names for the instance ids and the cutoff time values should be unambiguous. Passing a dataframe that contains both a column with the same name as the target dataframe ``index`` and a column named ``instance_id`` will result in an error. Similarly, if the cutoff time dataframe contains both a column with the same name as the target dataframe ``time_index`` and a column named ``time`` an error will be raised."
   ]
  },
  {
   "cell_type": "raw",
   "id": "6ffaffd0",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note::\n",
    "\n",
    "    Only the columns corresponding to the instance ids and the cutoff times are used to calculate features. Any additional columns passed through are appended to the resulting feature matrix. This is typically used to pass through machine learning labels to ensure that they stay aligned with the feature matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa5cc115",
   "metadata": {},
   "outputs": [],
   "source": [
    "cutoff_times = pd.DataFrame()\n",
    "cutoff_times[\"customer_id\"] = [1, 2, 3, 1]\n",
    "cutoff_times[\"time\"] = pd.to_datetime(\n",
    "    [\"2014-1-1 04:00\", \"2014-1-1 05:00\", \"2014-1-1 06:00\", \"2014-1-1 08:00\"]\n",
    ")\n",
    "cutoff_times[\"label\"] = [True, True, False, True]\n",
    "cutoff_times\n",
    "fm, features = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    cutoff_time=cutoff_times,\n",
    "    cutoff_time_in_index=True,\n",
    ")\n",
    "fm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6185bb0d",
   "metadata": {},
   "source": [
    "We can now see that every row of the feature matrix is calculated at the corresponding time in the cutoff time dataframe. Because we calculate each row at a different time, it is possible to have a repeat customer. In this case, we calculated the feature vector for customer 1 at both ``04:00`` and ``08:00``.\n",
    "\n",
    "Training Window\n",
    "---------------\n",
    "\n",
    "By default, all data up to and including the cutoff time is used. We can restrict the amount of historical data that is selected for calculations using a \"training window.\"\n",
    "\n",
    "Here's an example of using a two hour training window:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e321d463",
   "metadata": {},
   "outputs": [],
   "source": [
    "window_fm, window_features = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    cutoff_time=cutoff_times,\n",
    "    cutoff_time_in_index=True,\n",
    "    training_window=\"2 hour\",\n",
    ")\n",
    "\n",
    "window_fm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ee67c4d",
   "metadata": {},
   "source": [
    "We can see that that the counts for the same feature are lower after we shorten the training window:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93d6b9ae",
   "metadata": {},
   "outputs": [],
   "source": [
    "fm[[\"COUNT(transactions)\"]]\n",
    "\n",
    "window_fm[[\"COUNT(transactions)\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad7c73c4",
   "metadata": {},
   "source": [
    "## Setting a Last Time Index\n",
    "\n",
    "The training window in Featuretools limits the amount of past data that can be used while calculating a particular feature vector. A row in the dataframe is filtered out if the value of its time index is either before or after the training window. This works for dataframes where a row occurs at a single point in time. However, a row can sometimes exist for a duration.\n",
    "\n",
    "For example, a customer's session has multiple transactions which can happen at different points in time. If we are trying to count the number of sessions a user has in a given time period, we often want to count all the sessions that had *any* transaction during the training window. To accomplish this, we need to not only know when a session starts, but also when it ends. The last time that an instance appears in the data is stored in the `_ft_last_time` column on the dataframe. We can compare the time index and the last time index of the ``sessions`` dataframe above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "493c8193",
   "metadata": {},
   "outputs": [],
   "source": [
    "last_time_index_col = es[\"sessions\"].ww.metadata.get(\"last_time_index\")\n",
    "es[\"sessions\"][[\"session_start\", last_time_index_col]].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7f1c5cb",
   "metadata": {},
   "source": [
    "Featuretools can automatically add last time indexes to every DataFrame in an ``Entityset`` by running ``EntitySet.add_last_time_indexes()``. When using a training window, if a `last_time_index has` been set, Featuretools will check to see if the `last_time_index` is after the start of the training window. That, combined with the cutoff time, allows DFS to discover which data is relevant for a given training window.\n",
    "\n",
    "\n",
    "## Excluding data at cutoff times"
   ]
  },
  {
   "cell_type": "raw",
   "id": "b44bee57",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "The ``cutoff_time`` is the last point in time where data can be used for feature\n",
    "calculation. If you don't want to use the data at the cutoff time in feature\n",
    "calculation, you can exclude that data by setting ``include_cutoff_time`` to\n",
    "``False`` in :func:`featuretools.dfs` or :func:`featuretools.calculate_feature_matrix`.\n",
    "If you set it to ``True`` (the default behavior), data from the cutoff time point\n",
    "will be used."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e92d895",
   "metadata": {},
   "source": [
    "Setting ``include_cutoff_time`` to ``False`` also impacts how data at the edges\n",
    "of training windows are included or excluded.  Take this slice of data as an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76f9676f",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = es[\"transactions\"]\n",
    "df[df[\"session_id\"] == 1].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce77f6fd",
   "metadata": {},
   "source": [
    "Looking at the data, transactions occur every 65 seconds.  To check how ``include_cutoff_time``\n",
    "effects training windows, we can calculate features at the time of a transaction\n",
    "while using a 65 second training window.  This creates a training window with a\n",
    "transaction at both endpoints of the window.  For this example, we'll find the sum\n",
    "of all transactions for session id 1 that are in the training window."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1841d78b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from featuretools.primitives import Sum\n",
    "\n",
    "sum_log = ft.Feature(\n",
    "    es[\"transactions\"].ww[\"amount\"],\n",
    "    parent_dataframe_name=\"sessions\",\n",
    "    primitive=Sum,\n",
    ")\n",
    "cutoff_time = pd.DataFrame(\n",
    "    {\n",
    "        \"session_id\": [1],\n",
    "        \"time\": [\"2014-01-01 00:04:20\"],\n",
    "    }\n",
    ").astype({\"time\": \"datetime64[ns]\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c15be10",
   "metadata": {},
   "source": [
    "With ``include_cutoff_time=True``, the oldest point in the training window\n",
    "(``2014-01-01 00:03:15``) is excluded and the cutoff time point is included. This\n",
    "means only transaction 371 is in the training window, so the sum of all transaction\n",
    "amounts is 31.54"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f782683a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Case1. include_cutoff_time = True\n",
    "actual = ft.calculate_feature_matrix(\n",
    "    features=[sum_log],\n",
    "    entityset=es,\n",
    "    cutoff_time=cutoff_time,\n",
    "    cutoff_time_in_index=True,\n",
    "    training_window=\"65 seconds\",\n",
    "    include_cutoff_time=True,\n",
    ")\n",
    "actual"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "324246db",
   "metadata": {},
   "source": [
    "Whereas with ``include_cutoff_time=False``, the oldest point in the window is\n",
    "included and the cutoff time point is excluded.  So in this case transaction 116\n",
    "is included and transaction 371 is exluded, and the sum is 78.92\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b63bc68",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Case2. include_cutoff_time = False\n",
    "actual = ft.calculate_feature_matrix(\n",
    "    features=[sum_log],\n",
    "    entityset=es,\n",
    "    cutoff_time=cutoff_time,\n",
    "    cutoff_time_in_index=True,\n",
    "    training_window=\"65 seconds\",\n",
    "    include_cutoff_time=False,\n",
    ")\n",
    "actual"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4329314f",
   "metadata": {},
   "source": [
    "Approximating Features by Rounding Cutoff Times\n",
    "-----------------------------------------------\n",
    "\n",
    "For each unique cutoff time, Featuretools must perform operations to select the data that’s valid for computations. If there are a large number of unique cutoff times relative to the number of instances for which we are calculating features, the time spent filtering data can add up. By reducing the number of unique cutoff times, we minimize the overhead from searching for and extracting data for feature calculations.\n",
    "\n",
    "One way to decrease the number of unique cutoff times is to round cutoff times to an earlier point in time. An earlier cutoff time is always valid for predictive modeling — it just means we’re not using some of the data we could potentially use while calculating that feature. So, we gain computational speed by losing a small amount of information.\n",
    "\n",
    "To understand when an approximation is useful, consider calculating features for a model to predict fraudulent credit card transactions. In this case, an important feature might be, \"the average transaction amount for this card in the past\". While this value can change every time there is a new transaction, updating it less frequently might not impact accuracy."
   ]
  },
  {
   "cell_type": "raw",
   "id": "3628cc1c",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note::\n",
    "\n",
    "    The bank BBVA used approximation when building a predictive model for credit card fraud using Featuretools. For more details, see the \"Real-time deployment considerations\" section of the `white paper <https://arxiv.org/abs/1710.07709>`_ describing the work involved.\n"
   ]
  },
  {
   "cell_type": "raw",
   "id": "4bf10090",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "The frequency of approximation is controlled using the ``approximate`` parameter to :func:`featuretools.dfs` or :func:`featuretools.calculate_feature_matrix`. For example, the following code would approximate aggregation features at 1 day intervals::"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "641981d0",
   "metadata": {},
   "source": [
    "    fm = ft.calculate_feature_matrix(features=features,\n",
    "                                     entityset=es_transactions,\n",
    "                                     cutoff_time=ct_transactions,\n",
    "                                     approximate=\"1 day\")\n",
    "\n",
    "In this computation, features that can be approximated will be calculated at 1 day intervals, while features that cannot be approximated (e.g \"where did this transaction occur?\") will be calculated at the exact cutoff time.\n",
    "\n",
    "\n",
    "## Secondary Time Index\n",
    "\n",
    "It is sometimes the case that information in a dataset is updated or added after a row has been created. This means that certain columns may actually become known after the time index for a row. Rather than drop those columns to avoid leaking information, we can create a secondary time index to indicate when those columns become known."
   ]
  },
  {
   "cell_type": "raw",
   "id": "6f8197f9",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "The :func:`Flights <featuretools.demo.load_flight>` entityset is a good example of a dataset where column values in a row become known at different times. Each trip is recorded in the ``trip_logs`` dataframe, and has many times associated with it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6043477",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "import urllib.request as urllib2\n",
    "\n",
    "opener = urllib2.build_opener()\n",
    "opener.addheaders = [(\"Testing\", \"True\")]\n",
    "urllib2.install_opener(opener)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "abf92463",
   "metadata": {},
   "outputs": [],
   "source": [
    "es_flight = ft.demo.load_flight(nrows=100)\n",
    "es_flight\n",
    "es_flight[\"trip_logs\"].head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36827ff9",
   "metadata": {},
   "source": [
    "For every trip log, the time index is ``date_scheduled``, which is when the airline decided on the scheduled departure and arrival times, as well as what route will be flown. We don't know the rest of the information about the actual departure/arrival times and the details of any delay at this time. However, it is possible to know everything about how a trip went after it has arrived, so we can use that information at any time after the flight lands.\n",
    "\n",
    "Using a secondary time index, we can indicate to Featuretools which columns in our flight logs are known at the time the flight is scheduled, plus which are known at the time the flight lands.\n",
    "\n",
    "<img src=\"../_static/images/flight_ti_2.png\" width=\"400\" align=\"center\" alt=\"flight secondary time index diagram\">\n",
    "\n",
    "In Featuretools, when adding the dataframe to the ``EntitySet``, we set the secondary time index to be the arrival time like this:\n",
    "\n",
    "    es = ft.EntitySet('Flight Data')\n",
    "    arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay',\n",
    "                        'national_airspace_delay', 'security_delay',\n",
    "                        'late_aircraft_delay', 'canceled', 'diverted',\n",
    "                        'taxi_in', 'taxi_out', 'air_time', 'dep_time']\n",
    "\n",
    "    es.add_dataframe(\n",
    "        dataframe_name='trip_logs',\n",
    "        dataframe=data,\n",
    "        index='trip_log_id',\n",
    "        make_index=True,\n",
    "        time_index='date_scheduled',\n",
    "        secondary_time_index={'arr_time': arr_time_columns})\n",
    "\n",
    "By setting a secondary time index, we can still use the delay information from a row, but only when it becomes known."
   ]
  },
  {
   "cell_type": "raw",
   "id": "eaef7ec8",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. hint::\n",
    "\n",
    "    It's often a good idea to use a secondary time index if your entityset has inline labels. If you know when the label would be valid for use, it's possible to automatically create very predictive features using historical labels."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03448def",
   "metadata": {},
   "source": [
    "## Flight Predictions\n",
    "\n",
    "Let's make some features at varying times using the flight example described above. Trip ``14`` is a flight from CLT to PHX on January 31, 2017 and trip ``92`` is a flight from PIT to DFW on January 1. We can set any cutoff time before the flight is scheduled to depart, emulating how we would make the prediction at that point in time.\n",
    "\n",
    "We set two cutoff times for trip ``14`` at two different times: one which is more than a month before the flight and another which is only 5 days before. For trip ``92``, we'll only set one cutoff time, three days before it is scheduled to leave.\n",
    "\n",
    "<img src=\"../_static/images/flight_ct.png\" width=\"500\" align=\"center\" alt=\"flight cutoff time diagram\">\n",
    "\n",
    "Our cutoff time dataframe looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c338105b",
   "metadata": {},
   "outputs": [],
   "source": [
    "ct_flight = pd.DataFrame()\n",
    "ct_flight[\"trip_log_id\"] = [14, 14, 92]\n",
    "ct_flight[\"time\"] = pd.to_datetime([\"2016-12-28\", \"2017-1-25\", \"2016-12-28\"])\n",
    "ct_flight[\"label\"] = [True, True, False]\n",
    "ct_flight"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f26db5dd",
   "metadata": {},
   "source": [
    "Now, let's calculate the feature matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd56c24e",
   "metadata": {},
   "outputs": [],
   "source": [
    "fm, features = ft.dfs(\n",
    "    entityset=es_flight,\n",
    "    target_dataframe_name=\"trip_logs\",\n",
    "    cutoff_time=ct_flight,\n",
    "    cutoff_time_in_index=True,\n",
    "    agg_primitives=[\"max\"],\n",
    "    trans_primitives=[\"month\"],\n",
    ")\n",
    "fm[\n",
    "    [\n",
    "        \"flights.origin\",\n",
    "        \"flights.dest\",\n",
    "        \"label\",\n",
    "        \"flights.MAX(trip_logs.arr_delay)\",\n",
    "        \"MONTH(scheduled_dep_time)\",\n",
    "    ]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f367279c",
   "metadata": {},
   "source": [
    "Let's understand the output:\n",
    "\n",
    "1. A row was made for every id-time pair in ``ct_flight``, which is returned as the index of the feature matrix.\n",
    "\n",
    "2. The output was sorted by cutoff time. Because of the sorting, it's often helpful to pass in a label with the cutoff time dataframe so that it will remain sorted in the same fashion as the feature matrix. Any additional columns beyond ``id`` and ``cutoff_time`` will not be used for making features.\n",
    "\n",
    "3. The column ``flights.MAX(trip_logs.arr_delay)`` is not always defined. It can only have any real values when there are historical flights to aggregate. Notice that, for trip ``14``, there wasn't any historical data when we made the feature a month in advance, but there **were** flights to aggregate when we shortened it to 5 days. These are powerful features that are often excluded in manual processes because of how hard they are to make.\n",
    "\n",
    "\n",
    "Creating and Flattening a Feature Tensor\n",
    "----------------------------------------"
   ]
  },
  {
   "cell_type": "raw",
   "id": "3d5f23cc",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "The :func:`featuretools.make_temporal_cutoffs` function generates a series of equally spaced cutoff times from a given set of cutoff times and instance ids."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7b677e7",
   "metadata": {},
   "source": [
    "This function can be paired with DFS to create and flatten a feature tensor rather than making multiple feature matrices at different delays.\n",
    "\n",
    "The function\n",
    "takes in the the following parameters:\n",
    "\n",
    " * ``instance_ids (list, pd.Series, or np.ndarray)``: A list of instances.\n",
    " * ``cutoffs (list, pd.Series, or np.ndarray)``: An associated list of cutoff times.\n",
    " * ``window_size (str or pandas.DateOffset)``: The amount of time between each cutoff time in the created time series.\n",
    " * ``start (datetime.datetime or pd.Timestamp)``: The first cutoff time in the created time series.\n",
    " * ``num_windows (int)``: The number of cutoff times to create in the created time series.\n",
    "\n",
    "Only two of the three options ``window_size``, ``start``, and ``num_windows`` need to be specified to uniquely determine an equally-spaced set of cutoff times at which to compute each instance.\n",
    "\n",
    "If your cutoff times are the ones used above:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7648a9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "cutoff_times"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bda6ff4",
   "metadata": {},
   "source": [
    "Then passing in ``window_size='1h'`` and ``num_windows=2`` makes one row an hour over the last two hours to produce the following new dataframe. The result can be directly passed into DFS to make features at the different time points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4204f47",
   "metadata": {},
   "outputs": [],
   "source": [
    "temporal_cutoffs = ft.make_temporal_cutoffs(\n",
    "    cutoff_times[\"customer_id\"], cutoff_times[\"time\"], window_size=\"1h\", num_windows=2\n",
    ")\n",
    "temporal_cutoffs\n",
    "fm, features = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    cutoff_time=temporal_cutoffs,\n",
    "    cutoff_time_in_index=True,\n",
    ")\n",
    "fm"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: docs/source/getting_started/primitives.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. _primitives:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature primitives\n",
    "Feature primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. Because a primitive only constrains the input and output data types, they can be applied across datasets and can stack to create new calculations.\n",
    "\n",
    "## Why primitives?\n",
    "The space of potential functions that humans use to create a feature is expansive. By breaking common feature engineering calculations down into primitive components, we are able to capture the underlying structure of the features humans create today.\n",
    "\n",
    "A primitive only constrains the input and output data types. This means they can be used to transfer calculations known in one domain to another. Consider a feature which is often calculated by data scientists for transactional or event logs data: *average time between events*. This feature is incredibly valuable in predicting fraudulent behavior or future customer engagement.\n",
    "\n",
    "DFS achieves the same feature by stacking two primitives `\"time_since_previous\"` and `\"mean\"`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import featuretools as ft\n",
    "\n",
    "es = ft.demo.load_mock_customer(return_entityset=True)\n",
    "\n",
    "feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"mean\"],\n",
    "    trans_primitives=[\"time_since_previous\"],\n",
    "    features_only=True,\n",
    ")\n",
    "\n",
    "feature_defs"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note:: \n",
    "\n",
    "    The primitive arguments to DFS (eg. ``agg_primitives`` and ``trans_primitives`` in the example above) accept ``snake_case``, ``camelCase``, or ``TitleCase`` strings of included Featuretools primitives (ie. ``time_since_previous``,  ``timeSincePrevious``, and  ``TimeSincePrevious`` are all acceptable inputs).\n",
    "\n",
    ".. note::\n",
    "\n",
    "    When ``dfs`` is called with ``features_only=True``, only feature definitions are returned as output. By default this parameter is set to ``False``. This parameter is used quickly inspect the feature definitions before the spending time calculating the feature matrix."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A second advantage of primitives is that they can be used to quickly enumerate many interesting features in a parameterized way. This is used by Deep Feature Synthesis to get several different ways of summarizing the time since the previous event."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"mean\", \"max\", \"min\", \"std\", \"skew\"],\n",
    "    trans_primitives=[\"time_since_previous\"],\n",
    ")\n",
    "\n",
    "feature_matrix[\n",
    "    [\n",
    "        \"MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n",
    "        \"MAX(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n",
    "        \"MIN(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n",
    "        \"STD(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n",
    "        \"SKEW(sessions.TIME_SINCE_PREVIOUS(session_start))\",\n",
    "    ]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aggregation vs Transform Primitive\n",
    "\n",
    "In the example above, we use two types of primitives.\n",
    "\n",
    "**Aggregation primitives:** These primitives take related instances as an input and output a single value. They are applied across a parent-child relationship in an EntitySet. E.g: `\"count\"`, `\"sum\"`, `\"avg_time_between\"`."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. graphviz:: graphs/agg_feat.dot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Transform primitives:** These primitives take one or more columns from a dataframe as an input and output a new column for that dataframe. They are applied to a single dataframe. E.g: `\"hour\"`, `\"time_since_previous\"`, `\"absolute\"`."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. graphviz:: graphs/trans_feat.dot\n",
    "\n",
    "\n",
    "The above graphs were generated using the :func:`graph_feature <featuretools.graph_feature>` function. These feature lineage graphs help to visually show how primitives were stacked to generate a feature."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a DataFrame that lists and describes each built-in primitive in Featuretools, call `ft.list_primitives()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.list_primitives().head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a DataFrame of metrics that summarizes various properties and capabilities of all of the built-in primitives in Featuretools, call `ft.summarize_primitives()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.summarize_primitives()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining Custom Primitives\n",
    "\n",
    "The library of primitives in Featuretools is constantly expanding.  Users can define their own primitive using the APIs below.  To define a primitive, a user will\n",
    "\n",
    "\n",
    "  * Specify the type of primitive `Aggregation` or `Transform`\n",
    "  * Define the input and output data types\n",
    "  * Write a function in python to do the calculation\n",
    "  * Annotate with attributes to constrain how it is applied\n",
    "\n",
    "\n",
    "Once a primitive is defined, it can stack with existing primitives to generate complex patterns. This enables primitives known to be important for one domain to automatically be transfered to another."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from woodwork.column_schema import ColumnSchema\n",
    "from woodwork.logical_types import Datetime, NaturalLanguage\n",
    "\n",
    "from featuretools.primitives import AggregationPrimitive, TransformPrimitive\n",
    "from featuretools.tests.testing_utils import make_ecommerce_entityset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/markdown"
   },
   "source": [
    "### Simple Custom Primitives"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Absolute(TransformPrimitive):\n",
    "    name = \"absolute\"\n",
    "    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n",
    "    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n",
    "\n",
    "    def get_function(self):\n",
    "        def absolute(column):\n",
    "            return abs(column)\n",
    "\n",
    "        return absolute"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/markdown"
   },
   "source": [
    "Above, we created a new transform primitive that can be used with Deep Feature Synthesis by deriving a new primitive class using `TransformPrimitive` as a base and overriding `get_function` to return a function that calculates the feature. Additionally, we set the input data types that the primitive applies to and the return data type. Input and return data types are defined using a Woodwork ColumnSchema. A full guide on Woodwork logical types and semantic tags can be found in the Woodwork [Understanding Logical Types and Semantic Tags](https://woodwork.alteryx.com/en/stable/guides/logical_types_and_semantic_tags.html) guide.\n",
    "\n",
    "Similarly, we can make a new aggregation primitive using `AggregationPrimitive`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Maximum(AggregationPrimitive):\n",
    "    name = \"maximum\"\n",
    "    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n",
    "    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n",
    "\n",
    "    def get_function(self):\n",
    "        def maximum(column):\n",
    "            return max(column)\n",
    "\n",
    "        return maximum"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/markdown"
   },
   "source": [
    "Because we defined an aggregation primitive, the function takes in a list of values but only returns one.\n",
    "\n",
    "Now that we've defined two primitives, we can use them with the dfs function as if they were built-in primitives."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"sessions\",\n",
    "    agg_primitives=[Maximum],\n",
    "    trans_primitives=[Absolute],\n",
    "    max_depth=2,\n",
    ")\n",
    "\n",
    "feature_matrix.head(5)[\n",
    "    [\n",
    "        \"customers.MAXIMUM(transactions.amount)\",\n",
    "        \"MAXIMUM(transactions.ABSOLUTE(amount))\",\n",
    "    ]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "### Word Count Example\n",
    "\n",
    "Here we define a transform primitive, `WordCount`, which counts the number of words in each row of an input and returns a list of the counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class WordCount(TransformPrimitive):\n",
    "    \"\"\"\n",
    "    Counts the number of words in each row of the column. Returns a list\n",
    "    of the counts for each row.\n",
    "    \"\"\"\n",
    "\n",
    "    name = \"word_count\"\n",
    "    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n",
    "    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n",
    "\n",
    "    def get_function(self):\n",
    "        def word_count(column):\n",
    "            word_counts = []\n",
    "            for value in column:\n",
    "                words = value.split(None)\n",
    "                word_counts.append(len(words))\n",
    "            return word_counts\n",
    "\n",
    "        return word_count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = make_ecommerce_entityset()\n",
    "\n",
    "feature_matrix, features = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"sessions\",\n",
    "    agg_primitives=[\"sum\", \"mean\", \"std\"],\n",
    "    trans_primitives=[WordCount],\n",
    ")\n",
    "\n",
    "feature_matrix[\n",
    "    [\n",
    "        \"customers.WORD_COUNT(favorite_quote)\",\n",
    "        \"STD(log.WORD_COUNT(comments))\",\n",
    "        \"SUM(log.WORD_COUNT(comments))\",\n",
    "        \"MEAN(log.WORD_COUNT(comments))\",\n",
    "    ]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/markdown"
   },
   "source": [
    "By adding some aggregation primitives as well, Deep Feature Synthesis was able to make four new features from one new primitive.\n",
    "\n",
    "### Multiple Input Types\n",
    "\n",
    "If a primitive requires multiple features as input, `input_types` has multiple elements, eg `[ColumnSchema(semantic_tags={'numeric'}), ColumnSchema(semantic_tags={'numeric'})]` would mean the primitive requires two columns with the semantic tag `numeric` as input. Below is an example of a primitive that has multiple input features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MeanSunday(AggregationPrimitive):\n",
    "    \"\"\"\n",
    "    Finds the mean of non-null values of a feature that occurred on Sundays\n",
    "    \"\"\"\n",
    "\n",
    "    name = \"mean_sunday\"\n",
    "    input_types = [\n",
    "        ColumnSchema(semantic_tags={\"numeric\"}),\n",
    "        ColumnSchema(logical_type=Datetime),\n",
    "    ]\n",
    "    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n",
    "\n",
    "    def get_function(self):\n",
    "        def mean_sunday(numeric, datetime):\n",
    "            days = pd.DatetimeIndex(datetime).weekday.values\n",
    "            df = pd.DataFrame({\"numeric\": numeric, \"time\": days})\n",
    "            return df[df[\"time\"] == 6][\"numeric\"].mean()\n",
    "\n",
    "        return mean_sunday"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, features = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"sessions\",\n",
    "    agg_primitives=[MeanSunday],\n",
    "    trans_primitives=[],\n",
    "    max_depth=1,\n",
    ")\n",
    "\n",
    "feature_matrix[\n",
    "    [\n",
    "        \"MEAN_SUNDAY(log.value, datetime)\",\n",
    "        \"MEAN_SUNDAY(log.value_2, datetime)\",\n",
    "    ]\n",
    "]"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: docs/source/getting_started/using_entitysets.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Representing Data with EntitySets\n",
    "\n",
    "An ``EntitySet`` is a collection of dataframes and the relationships between them. They are useful for preparing raw, structured datasets for feature engineering. While many functions in Featuretools  take ``dataframes`` and ``relationships`` as separate arguments, it is recommended to create an ``EntitySet``, so you can more easily manipulate your data as needed.\n",
    "\n",
    "## The Raw Data\n",
    "\n",
    "Below we have two tables of data (represented as Pandas DataFrames) related to customer transactions. The first is a merge of transactions, sessions, and customers so that the result looks like something you might see in a log file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import featuretools as ft\n",
    "\n",
    "data = ft.demo.load_mock_customer()\n",
    "transactions_df = data[\"transactions\"].merge(data[\"sessions\"]).merge(data[\"customers\"])\n",
    "\n",
    "transactions_df.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the second dataframe is a list of products involved in those transactions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "products_df = data[\"products\"]\n",
    "products_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating an EntitySet\n",
    "\n",
    "First, we initialize an ``EntitySet``. If you'd like to give it a name, you can optionally provide an ``id`` to the constructor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = ft.EntitySet(id=\"customer_data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding dataframes\n",
    "\n",
    "To get started, we add the transactions dataframe to the `EntitySet`. In the call to ``add_dataframe``, we specify three important parameters:\n",
    "\n",
    "* The ``index`` parameter specifies the column that uniquely identifies rows in the dataframe.\n",
    "* The ``time_index`` parameter tells Featuretools when the data was created.\n",
    "* The ``logical_types`` parameter indicates that \"product_id\" should be interpreted as a Categorical column, even though it is just an integer in the underlying data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from woodwork.logical_types import Categorical, PostalCode\n",
    "\n",
    "es = es.add_dataframe(\n",
    "    dataframe_name=\"transactions\",\n",
    "    dataframe=transactions_df,\n",
    "    index=\"transaction_id\",\n",
    "    time_index=\"transaction_time\",\n",
    "    logical_types={\n",
    "        \"product_id\": Categorical,\n",
    "        \"zip_code\": PostalCode,\n",
    "    },\n",
    ")\n",
    "\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also use a setter on the ``EntitySet`` object to add dataframes"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. currentmodule:: featuretools\n",
    "\n",
    "\n",
    ".. note ::\n",
    "\n",
    "    You can also use a setter on the ``EntitySet`` object to add dataframes\n",
    "\n",
    "    ``es[\"transactions\"] = transactions_df``\n",
    "\n",
    "    that this will use the default implementation of `add_dataframe`, notably the following:\n",
    "\n",
    "    * if the DataFrame does not have `Woodwork <https://woodwork.alteryx.com/>`_ initialized, the first column will be the index column\n",
    "    * if the DataFrame does not have Woodwork initialized, all columns will be inferred by Woodwork.\n",
    "    * if control over the time index column and logical types is needed, Woodwork should be initialized before adding the dataframe.\n",
    "\n",
    ".. note ::\n",
    "\n",
    "    You can also display your `EntitySet` structure graphically by calling :meth:`.EntitySet.plot`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method associates each column in the dataframe to a [Woodwork](https://woodwork.alteryx.com/) logical type. Each logical type can have an associated standard semantic tag that helps define the column data type. If you don't specify the logical type for a column, it gets inferred based on the underlying data. The logical types and semantic tags are listed in the schema of the dataframe. For more information on working with logical types and semantic tags, take a look at the [Woodwork documention](https://woodwork.alteryx.com/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"transactions\"].ww.schema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can do that same thing with our products dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = es.add_dataframe(\n",
    "    dataframe_name=\"products\", dataframe=products_df, index=\"product_id\"\n",
    ")\n",
    "\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With two dataframes in our `EntitySet`, we can add a relationship between them.\n",
    "\n",
    "## Adding a Relationship\n",
    "\n",
    "We want to relate these two dataframes by the columns called \"product_id\" in each dataframe. Each product has multiple transactions associated with it, so it is called the **parent dataframe**, while the transactions dataframe is known as the **child dataframe**. When specifying relationships, we need four parameters: the parent dataframe name, the parent column name, the child dataframe name, and the child column name. Note that each relationship must denote a one-to-many relationship rather than a relationship which is one-to-one or many-to-many."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = es.add_relationship(\"products\", \"product_id\", \"transactions\", \"product_id\")\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we see the relationship has been added to our `EntitySet`.\n",
    "\n",
    "## Creating a dataframe from an existing table\n",
    "\n",
    "When working with raw data, it is common to have sufficient information to justify the creation of new dataframes. In order to create a new dataframe and relationship for sessions, we \"normalize\" the transaction dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = es.normalize_dataframe(\n",
    "    base_dataframe_name=\"transactions\",\n",
    "    new_dataframe_name=\"sessions\",\n",
    "    index=\"session_id\",\n",
    "    make_time_index=\"session_start\",\n",
    "    additional_columns=[\n",
    "        \"device\",\n",
    "        \"customer_id\",\n",
    "        \"zip_code\",\n",
    "        \"session_start\",\n",
    "        \"join_date\",\n",
    "    ],\n",
    ")\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the output above, we see this method did two operations:\n",
    "\n",
    "1. It created a new dataframe called \"sessions\" based on the \"session_id\" and \"session_start\" columns in \"transactions\"\n",
    "2. It added a relationship connecting \"transactions\" and \"sessions\"\n",
    "\n",
    "If we look at the schema from the transactions dataframe and the new sessions dataframe, we see two more operations that were performed automatically:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"transactions\"].ww.schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"sessions\"].ww.schema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. It removed \"device\", \"customer_id\", \"zip_code\" and \"join_date\" from \"transactions\" and created a new columns in the sessions dataframe. This reduces redundant information as the those properties of a session don't change between transactions.\n",
    "2. It copied and marked \"session_start\" as a time index column into the new sessions dataframe to indicate the beginning of a session. If the base dataframe has a time index and ``make_time_index`` is not set, ``normalize_dataframe`` will create a time index for the new dataframe. In this case it would create a new time index called \"first_transactions_time\" using the time of the first transaction of each session. If we don't want this time index to be created, we can set ``make_time_index=False``.\n",
    "\n",
    "If we look at the dataframes, we can see what ``normalize_dataframe`` did to the actual data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"sessions\"].head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"transactions\"].head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To finish preparing this dataset, create a \"customers\" dataframe using the same method call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = es.normalize_dataframe(\n",
    "    base_dataframe_name=\"sessions\",\n",
    "    new_dataframe_name=\"customers\",\n",
    "    index=\"customer_id\",\n",
    "    make_time_index=\"join_date\",\n",
    "    additional_columns=[\"zip_code\", \"join_date\"],\n",
    ")\n",
    "\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using the EntitySet\n",
    "\n",
    "Finally, we are ready to use this EntitySet with any functionality within Featuretools. For example, let's build a feature matrix for each product in our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name=\"products\")\n",
    "\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "As we can see, the features from DFS use the relational structure of our `EntitySet`. Therefore it is important to think carefully about the dataframes that we create."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: docs/source/getting_started/woodwork_types.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b95b28c1",
   "metadata": {},
   "source": [
    "# Woodwork Typing in Featuretools\n",
    "\n",
    "Featuretools relies on having consistent typing across the creation of EntitySets, Primitives, Features, and feature matrices. Previously, Featuretools used its own type system that contained objects called Variables. Now and moving forward, Featuretools will use an external data typing library for its typing: [Woodwork](https://woodwork.alteryx.com/en/stable/index.html).\n",
    "\n",
    "Understanding the Woodwork types that exist and how Featuretools uses Woodwork's type system will allow users to:\n",
    "    - build EntitySets that best represent their data\n",
    "    - understand the possible input and return types for Featuretools' Primitives\n",
    "    - understand what features will get generated from a given set of data and primitives.\n",
    "\n",
    "Read the [Understanding Woodwork Logical Types and Semantic Tags](https://woodwork.alteryx.com/en/stable/guides/logical_types_and_semantic_tags.html) guide for an in-depth walkthrough of the available Woodwork types that are outlined below.\n",
    "\n",
    "For users that are familiar with the old `Variable` objects, the [Transitioning to Featuretools Version 1.0](../resources/transition_to_ft_v1.0.ipynb) guide will be useful for converting Variable types to Woodwork types.\n",
    "\n",
    "## Physical Types \n",
    "Physical types define how the data in a Woodwork DataFrame is stored on disk or in memory. You might also see the physical type for a column referred to as the column’s `dtype`.\n",
    "\n",
    "Knowing a Woodwork DataFrame's physical types is important because Pandas relies on these types when performing DataFrame operations. Each Woodwork `LogicalType` class has a single physical type associated with it.\n",
    "\n",
    "## Logical Types\n",
    "Logical types add additional information about how data should be interpreted or parsed beyond what can be contained in a physical type. In fact, multiple logical types have the same physical type, each imparting a different meaning that's not contained in the physical type alone.\n",
    "\n",
    "In Featuretools, a column's logical type informs how data is read into an EntitySet and how it gets used down the line in Deep Feature Synthesis.\n",
    "\n",
    "Woodwork provides many different logical types, which can be seen with the `list_logical_types` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "497712b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import featuretools as ft\n",
    "\n",
    "ft.list_logical_types()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfe99d0f",
   "metadata": {},
   "source": [
    "Featuretools will perform type inference to assign logical types to the data in EntitySets if none are provided, but it is also possible to specify which logical types should be set for any column (provided that the data in that column is compatible with the logical type).\n",
    "\n",
    "To learn more about how logical types are used in EntitySets, see the [Creating EntitySets](using_entitysets.ipynb) guide.\n",
    "\n",
    "To learn more about setting logical types directly on a DataFrame, see the Woodwork guide on [working with Logical Types](https://woodwork.alteryx.com/en/stable/guides/working_with_types_and_tags.html#Working-with-Logical-Types). \n",
    "\n",
    "## Semantic Tags\n",
    "Semantic tags provide additional information to columns about the meaning or potential uses of data. Columns can have many or no semantic tags. Some tags are added by Woodwork, some are added by Featuretools, and users can add additional tags as they see fit.\n",
    "\n",
    "To learn more about setting semantic tags directly on a DataFrame, see the Woodwork guide on [working with Semantic Tags](https://woodwork.alteryx.com/en/stable/guides/working_with_types_and_tags.html#Working-with-Semantic-Tags). \n",
    "\n",
    "### Woodwork-defined Semantic Tags\n",
    "\n",
    "Woodwork will add certain semantic tags to columns at initialization. These can be standard tags that may be associated with different sets of logical types or index tags. There are also tags that users can add to confer a suggested meaning to columns in Woodwork.\n",
    "\n",
    "To get a list of these tags, you can use the `list_semantic_tags` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11f25bd9",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.list_semantic_tags()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29222810",
   "metadata": {},
   "source": [
    "Above we see the semantic tags that are defined within Woodwork. These tags inform how Featuretools is able to interpret data, an example of which can be seen in the `Age` primitive, which requires that the `date_of_birth` semantic tag be present on a column.\n",
    "\n",
    "The `date_of_birth` tag will not get automatically added by Woodwork, so in order for Featuretools to be able to use the `Age` primitive, the `date_of_birth` tag must be manually added to any columns to which it applies.\n",
    "\n",
    "### Featuretools-defined Semantic Tags\n",
    "\n",
    "Just like Woodwork specifies semantic tags internally, Featuretools also defines a few tags of its own that allow the full set of Features to be generated. These tags have specific meanings when they are present on a column.\n",
    "\n",
    "- `'last_time_index'` - added by Featuretools to the last time index column of a DataFrame. Indicates that this column has been created by Featuretools.\n",
    "- `'foreign_key'` - used to indicate that this column is the child column of a relationship, meaning that this column is related to a corresponding index column of another dataframe in the EntitySet.\n",
    "\n",
    "\n",
    "## Woodwork Throughout Featuretools\n",
    "\n",
    "Now that we've described the elements that make up Woodwork's type system, lets see them in action in Featuretools.\n",
    "\n",
    "### Woodwork in EntitySets\n",
    "For more information on building EntitySets using Woodwork, see the [EntitySet guide](using_entitysets.ipynb).\n",
    "\n",
    "Let's look at the Woodwork typing information as it's stored in a demo EntitySet of retail data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd9c1ec9",
   "metadata": {},
   "outputs": [],
   "source": [
    "es = ft.demo.load_retail()\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "267880c4",
   "metadata": {},
   "source": [
    "Woodwork typing information is not stored in the EntitySet object, but rather is stored in the individual DataFrames that make up the EntitySet. To look at the Woodwork typing information, we first select a single DataFrame from the EntitySet, and then access the Woodwork information via the `ww` namespace:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa1966fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = es[\"products\"]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "164b1138",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.ww"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bffac54",
   "metadata": {},
   "source": [
    "Notice how the three columns showing this DataFrame's typing information are the three elements of typing information outlined at the beginning of this guide. To reiterate: By defining physical types, logical types, and semantic tags for each column in a DataFrame, we've defined a DataFrame's Woodwork schema, and with it, we can gain an understanding of the contents of each column.\n",
    "\n",
    "This column-specific typing information that exists for every column in every DataFrame in an EntitySet is an integral part of Deep Feature Synthesis' ability to generate features for an EntitySet.\n",
    "\n",
    "### Woodwork in DFS\n",
    "As the units of computation in Featuretools, Primitives need to be able to specify the input types that they allow as well as have a predictable return type. For an in-depth explanation of Primitives in Featuretools, see the [Feature Primitives](primitives.ipynb) guide. Here, we'll look at how the Woodwork types come together into a `ColumnSchema` object to describe Primitive input and return types.\n",
    "\n",
    "Below is a Woodwork `ColumnSchema` that we've obtained from the `'product_id'` column in the `products` DataFrame in the retail EntitySet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "349e5274",
   "metadata": {},
   "outputs": [],
   "source": [
    "products_df = es[\"products\"]\n",
    "product_ids_series = products_df.ww[\"product_id\"]\n",
    "column_schema = product_ids_series.ww.schema\n",
    "column_schema"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e8c0ccf",
   "metadata": {},
   "source": [
    "This combination of logical type and semantic tag typing information is a `ColumnSchema`. In the case above, the `ColumnSchema` describes the **type definition** for a single column of data. \n",
    "\n",
    "Notice that there is no physical type in a `ColumnSchema`. This is because a `ColumnSchema` is a collection of Woodwork types that doesn't have any data tied to it and therefore has no physical representation. Because a `ColumnSchema` object is not tied to any data, it can also be used to describe a **type space** into which other columns may or may not fall.\n",
    "\n",
    "This flexibility of the `ColumnSchema` class allows `ColumnSchema` objects to be used both as type definitions for every column in an EntitySet as well as input and return type spaces for every Primitive in Featuretools.\n",
    "\n",
    "Let's look at a different column in a different DataFrame to see how this works:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3bb3ffe",
   "metadata": {},
   "outputs": [],
   "source": [
    "order_products_df = es[\"order_products\"]\n",
    "order_products_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1aae3378",
   "metadata": {},
   "outputs": [],
   "source": [
    "quantity_series = order_products_df.ww[\"quantity\"]\n",
    "column_schema = quantity_series.ww.schema\n",
    "column_schema"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f067db9a",
   "metadata": {},
   "source": [
    "The `ColumnSchema` above has been pulled from the `'quantity'` column in the `order_products` DataFrame in the retail EntitySet. This is a **type definition**. \n",
    "\n",
    "If we look at the Woodwork typing information for the `order_products` DataFrame, we can see that there are several columns that will have similar `ColumnSchema` type definitions. If we wanted to describe subsets of those columns, we could define several `ColumnSchema` **type spaces**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc2bfae6",
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"order_products\"].ww"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73257dcf",
   "metadata": {},
   "source": [
    "Below are several `ColumnSchema`s that all would include our `quantity` column, but each of them describes a different type space. These `ColumnSchema`s get more restrictive as we go down:\n",
    "\n",
    "##### Entire DataFrame\n",
    "No restrictions have been placed; any column falls into this definition. This would include the whole DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6614c98",
   "metadata": {},
   "outputs": [],
   "source": [
    "from woodwork.column_schema import ColumnSchema\n",
    "\n",
    "ColumnSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "299fc7d2",
   "metadata": {},
   "source": [
    "An example of a Primitive with this `ColumnSchema` as its input type is the `IsNull` transform primitive.\n",
    "\n",
    "##### By Semantic Tag\n",
    "Only columns with the `numeric` tag apply. This can include Double, Integer, and Age logical type columns as well. It will not include the `index` column which, despite containing integers, has had its standard tags replaced by the `'index'` tag."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16c1a5a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "ColumnSchema(semantic_tags={\"numeric\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0932d05d",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = es[\"order_products\"].ww.select(include=\"numeric\")\n",
    "df.ww"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5ec95c8",
   "metadata": {},
   "source": [
    "And example of a Primitive with this `ColumnSchema` as its input type is the `Mean` aggregation primitive.\n",
    "\n",
    "##### By Logical Type\n",
    "Only columns with logical type of `Integer` are included in this definition. Does not require the `numeric` tag, so an index column (which has its standard tags removed) would still apply."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79bd3d4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from woodwork.logical_types import Integer\n",
    "\n",
    "ColumnSchema(logical_type=Integer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e905229e",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = es[\"order_products\"].ww.select(include=\"Integer\")\n",
    "df.ww"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f752200",
   "metadata": {},
   "source": [
    "##### By Logical Type and Semantic Tag\n",
    "The column must have logical type `Integer` and have the `numeric` semantic tag, excluding index columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6da51b75",
   "metadata": {},
   "outputs": [],
   "source": [
    "ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a96d92f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = es[\"order_products\"].ww.select(include=\"numeric\")\n",
    "df = df.ww.select(include=\"Integer\")\n",
    "df.ww"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71e0359b",
   "metadata": {},
   "source": [
    "In this way, a `ColumnSchema` can define a type space under which columns in a Woodwork DataFrame can fall. This is how Featuretools determines which columns in a DataFrame are valid for a Primitive in building Features during DFS.\n",
    "\n",
    "Each Primitive has `input_types` and a `return_type` that are described by a Woodwork `ColumnSchema`. Every DataFrame in an EntitySet has Woodwork initialized on it. This means that when an EntitySet is passed into DFS, Featuretools can select the relevant columns in the DataFrame that are valid for the Primitive's `input_types`. We then get a Feature that has a `column_schema` property that indicates what that Feature's typing definition is in a way that lets DFS stack features on top of one another.\n",
    "\n",
    "In this way, Featuretools is able to leverage the base unit of Woodwork typing information, the `ColumnSchema`, and use it in concert with an EntitySet of Woodwork DataFrames in order to build Features with Deep Feature Synthesis."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: docs/source/guides/advanced_custom_primitives.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Advanced Custom Primitives Guide"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "import numpy as np\n",
    "from woodwork.column_schema import ColumnSchema\n",
    "from woodwork.logical_types import Datetime, NaturalLanguage\n",
    "\n",
    "import featuretools as ft\n",
    "from featuretools.primitives import TransformPrimitive\n",
    "from featuretools.tests.testing_utils import make_ecommerce_entityset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Primitives with Additional Arguments\n",
    "\n",
    "Some features require more advanced calculations than others. Advanced features usually entail additional arguments to help output the desired value. With custom primitives, you can use primitive arguments to help you create advanced features.\n",
    "\n",
    "### String Count Example\n",
    "\n",
    "In this example, you will learn how to make custom primitives that take in additional arguments. You will create a primitive to count the number of times a specific string value occurs inside a text.\n",
    "\n",
    "First, derive a new transform primitive class using `TransformPrimitive` as a base. The primitive will take in a text column as the input and return a numeric column as the output, so set the input type to a Woodwork `ColumnSchema` with logical type `NaturalLanguage` and the return type to a Woodwork `ColumnSchema` with the semantic tag `'numeric'`. The specific string value is the additional argument, so define it as a *keyword* argument inside `__init__`. Then, override `get_function` to return a primitive function that will calculate the feature.\n",
    "\n",
    "Featuretools' primitives use Woodwork's `ColumnSchema` to control the input and return types of columns for the primitive. For more information about using the Woodwork typing system in Featuretools, see the [Woodwork Typing in Featuretools](../getting_started/woodwork_types.ipynb) guide."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class StringCount(TransformPrimitive):\n",
    "    \"\"\"Count the number of times the string value occurs.\"\"\"\n",
    "\n",
    "    name = \"string_count\"\n",
    "    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n",
    "    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n",
    "\n",
    "    def __init__(self, string=None):\n",
    "        self.string = string\n",
    "\n",
    "    def get_function(self):\n",
    "        def string_count(column):\n",
    "            assert self.string is not None, \"string to count needs to be defined\"\n",
    "            # this is a naive implementation used for clarity\n",
    "            counts = [text.lower().count(self.string) for text in column]\n",
    "            return counts\n",
    "\n",
    "        return string_count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you have a primitive that is reusable for different string values. For example, you can create features based on the number of times the word \"the\" appears in a text. Create an instance of the primitive where the string value is \"the\" and pass the primitive into DFS to generate the features. The feature name will automatically reflect the string value of the primitive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = make_ecommerce_entityset()\n",
    "\n",
    "feature_matrix, features = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"sessions\",\n",
    "    agg_primitives=[\"sum\", \"mean\", \"std\"],\n",
    "    trans_primitives=[StringCount(string=\"the\")],\n",
    ")\n",
    "\n",
    "feature_matrix[\n",
    "    [\n",
    "        \"STD(log.STRING_COUNT(comments, string=the))\",\n",
    "        \"SUM(log.STRING_COUNT(comments, string=the))\",\n",
    "        \"MEAN(log.STRING_COUNT(comments, string=the))\",\n",
    "    ]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Features with Multiple Outputs\n",
    "\n",
    "Some calculations output more than a single value. With custom primitives, you can make the most of these calculations by creating a feature for each output value.\n",
    "\n",
    "### Case Count Example\n",
    "\n",
    "In this example, you will learn how to make custom primitives that output multiple features. You will create a primitive that outputs the count of upper case and lower case letters of a text.\n",
    "\n",
    "First, derive a new transform primitive class using `TransformPrimitive` as a base. The primitive will take in a text column as the input and return two numeric columns as the output, so set the input type to a Woodwork `ColumnSchema` with logical type `NaturalLanguage` and the return type to a Woodwork `ColumnSchema` with semantic tag `'numeric'`. Since this primitive returns two columns, also set `number_output_features` to two. Then, override `get_function` to return a primitive function that will calculate the feature and return a list of columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class CaseCount(TransformPrimitive):\n",
    "    \"\"\"Return the count of upper case and lower case letters of a text.\"\"\"\n",
    "\n",
    "    name = \"case_count\"\n",
    "    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n",
    "    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n",
    "    number_output_features = 2\n",
    "\n",
    "    def get_function(self):\n",
    "        def case_count(array):\n",
    "            # this is a naive implementation used for clarity\n",
    "            upper = np.array([len(re.findall(\"[A-Z]\", i)) for i in array])\n",
    "            lower = np.array([len(re.findall(\"[a-z]\", i)) for i in array])\n",
    "            return upper, lower\n",
    "\n",
    "        return case_count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you have a primitive that outputs two columns. One column contains the count for the upper case letters. The other column contains the count for the lower case letters. Pass the primitive into DFS to generate features. By default, the feature name will reflect the index of the output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, features = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"sessions\",\n",
    "    agg_primitives=[],\n",
    "    trans_primitives=[CaseCount],\n",
    ")\n",
    "\n",
    "feature_matrix[\n",
    "    [\n",
    "        \"customers.CASE_COUNT(favorite_quote)[0]\",\n",
    "        \"customers.CASE_COUNT(favorite_quote)[1]\",\n",
    "    ]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Custom Naming for Multiple Outputs\n",
    "\n",
    "When you create a primitive that outputs multiple features, you can also define custom naming for each of those features.\n",
    "\n",
    "### Hourly Sine and Cosine Example\n",
    "\n",
    "In this example, you will learn how to apply custom naming for multiple outputs. You will create a primitive that outputs the sine and cosine of the hour.\n",
    "\n",
    "First, derive a new transform primitive class using `TransformPrimitive` as a base. The primitive will take in the time index as the input and return two numeric columns as the output. Set the input type to a Woodwork `ColumnSchema` with a logical type of `Datetime` and the semantic tag `'time_index'`. Next, set the return type to a Woodwork `ColumnSchema` with semantic tag `'numeric'` and set `number_output_features` to two. Then, override `get_function` to return a primitive function that will calculate the feature and return a list of columns. Also, override `generate_names` to return a list of the feature names that you define."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class HourlySineAndCosine(TransformPrimitive):\n",
    "    \"\"\"Returns the sine and cosine of the hour.\"\"\"\n",
    "\n",
    "    name = \"hourly_sine_and_cosine\"\n",
    "    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"})]\n",
    "    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n",
    "\n",
    "    number_output_features = 2\n",
    "\n",
    "    def get_function(self):\n",
    "        def hourly_sine_and_cosine(column):\n",
    "            sine = np.sin(column.dt.hour)\n",
    "            cosine = np.cos(column.dt.hour)\n",
    "            return sine, cosine\n",
    "\n",
    "        return hourly_sine_and_cosine\n",
    "\n",
    "    def generate_names(self, base_feature_names):\n",
    "        name = self.generate_name(base_feature_names)\n",
    "        return f\"{name}[sine]\", f\"{name}[cosine]\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you have a primitive that outputs two columns. One column contains the sine of the hour. The other column contains the cosine of the hour. Pass the primitive into DFS to generate features. The feature name will reflect the custom naming you defined."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, features = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"log\",\n",
    "    agg_primitives=[],\n",
    "    trans_primitives=[HourlySineAndCosine],\n",
    ")\n",
    "\n",
    "feature_matrix.head()[\n",
    "    [\n",
    "        \"HOURLY_SINE_AND_COSINE(datetime)[sine]\",\n",
    "        \"HOURLY_SINE_AND_COSINE(datetime)[cosine]\",\n",
    "    ]\n",
    "]"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: docs/source/guides/deployment.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "92a0dab5",
   "metadata": {},
   "source": [
    "# Deployment\n",
    "\n",
    "Deployment of machine learning models requires repeating feature engineering steps on new data. In some cases, these steps need to be performed in near real-time. Featuretools has capabilities to ease the deployment of feature engineering.\n",
    "\n",
    "## Saving Features\n",
    "\n",
    "First, let's build some generate some training and test data in the same format. We use a random seed to generate different data for the test."
   ]
  },
  {
   "cell_type": "raw",
   "id": "129c8011",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note ::\n",
    "\n",
    "    Features saved in one version of Featuretools are not guaranteed to load in another. This means the features might need to be re-created after upgrading Featuretools."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01c19e97",
   "metadata": {},
   "outputs": [],
   "source": [
    "import featuretools as ft\n",
    "\n",
    "es_train = ft.demo.load_mock_customer(return_entityset=True)\n",
    "es_test = ft.demo.load_mock_customer(return_entityset=True, random_seed=33)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "042f8c02",
   "metadata": {},
   "source": [
    "Now let's build some features definitions using DFS. Because we have categorical features, we also encode them with one hot encoding based on the values in the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6bcc87a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es_train, target_dataframe_name=\"customers\"\n",
    ")\n",
    "\n",
    "feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)\n",
    "feature_matrix_enc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03ffe00a",
   "metadata": {},
   "source": [
    "Now, we can use [featuretools.save_features](../generated/featuretools.save_features.rst#featuretools.save_features) to save a list features to a json file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79d4ff65",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.save_features(features_enc, \"feature_definitions.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67723f25",
   "metadata": {},
   "source": [
    "## Calculating Feature Matrix for New Data\n",
    "\n",
    "We can use [featuretools.load_features](../generated/featuretools.load_features.rst#featuretools.load_features) to read in a list of saved features to calculate for our new entity set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8f728c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "saved_features = ft.load_features(\"feature_definitions.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1624ea4d",
   "metadata": {},
   "source": [
    "After we load the features back in, we can calculate the feature matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f37f61e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix = ft.calculate_feature_matrix(saved_features, es_test)\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9f39b54",
   "metadata": {},
   "source": [
    "As you can see above, we have the exact same features as before, but calculated using the test data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42a47ad9",
   "metadata": {},
   "source": [
    "## Exporting Feature Matrix\n",
    "\n",
    "### Save as csv\n",
    "\n",
    "The feature matrix is a pandas DataFrame that we can save to disk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "570c69fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix.to_csv(\"feature_matrix.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0fc5342",
   "metadata": {},
   "source": [
    "We can also read it back in as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "297db0a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "saved_fm = pd.read_csv(\"feature_matrix.csv\", index_col=\"customer_id\")\n",
    "saved_fm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b84dc51",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.remove(\"feature_definitions.json\")\n",
    "os.remove(\"feature_matrix.csv\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: docs/source/guides/feature_descriptions.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1557274d",
   "metadata": {},
   "source": [
    "# Generating Feature Descriptions\n",
    "\n",
    "As features become more complicated, their names can become harder to understand. Both the [describe_feature](https://featuretools.alteryx.com/en/latest/generated/featuretools.graph_feature.html) function and the [graph_feature](https://featuretools.alteryx.com/en/latest/generated/featuretools.describe_feature.html) function can help explain what a feature is and the steps Featuretools took to generate it. Additionally, the ``describe_feature`` function can be augmented by providing custom definitions and templates to improve the resulting descriptions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cdb8b3eb",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "import featuretools as ft\n",
    "\n",
    "es = ft.demo.load_mock_customer(return_entityset=True)\n",
    "\n",
    "feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"mean\", \"sum\", \"mode\", \"n_most_common\"],\n",
    "    trans_primitives=[\"month\", \"hour\"],\n",
    "    max_depth=2,\n",
    "    features_only=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01f8209c",
   "metadata": {},
   "source": [
    "By default, ``describe_feature`` uses the existing column and DataFrame names and the default primitive description templates to generate feature descriptions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35b86722",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_defs[9]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e24bee8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.describe_feature(feature_defs[9])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5402e848",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_defs[14]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac22c09c",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.describe_feature(feature_defs[14])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff9b7b35",
   "metadata": {},
   "source": [
    "## Improving Descriptions\n",
    "\n",
    "While the default descriptions can be helpful, they can also be further improved by providing custom definitions of columns and features, and by providing alternative templates for primitive descriptions. \n",
    "\n",
    "#### Feature Descriptions\n",
    "Custom feature definitions will get used in the description in place of the automatically generated description. This can be used to better explain what a `ColumnSchema` or feature is, or to provide descriptions that take advantage of a user's existing knowledge about the data or domain. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33b2f8e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_descriptions = {\"customers: join_date\": \"the date the customer joined\"}\n",
    "\n",
    "ft.describe_feature(feature_defs[9], feature_descriptions=feature_descriptions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "218147f4",
   "metadata": {},
   "source": [
    "For example, the above replaces the column name, ``\"join_date\"``, with a more descriptive definition of what that column represents in the dataset. Descriptions can also be set directly on a column in a DataFrame by going through the Woodwork typing information to access the ``description`` attribute present on each `ColumnSchema`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "597e20a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "join_date_column_schema = es[\"customers\"].ww.columns[\"join_date\"]\n",
    "join_date_column_schema.description = \"the date the customer joined\"\n",
    "\n",
    "es[\"customers\"].ww.columns[\"join_date\"].description"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c013615",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature = ft.TransformFeature(es[\"customers\"].ww[\"join_date\"], ft.primitives.Hour)\n",
    "feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "03e828b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.describe_feature(feature)"
   ]
  },
  {
   "cell_type": "raw",
   "id": "689cbd98",
   "metadata": {},
   "source": [
    ".. note::\n",
    "\n",
    "    When setting a description on a column in a DataFrame as described above, be careful to avoid setting the description via ``df.ww[col_name].ww.description``. The use of ``df.ww[col_name]`` creates an entirely new Series object that is not related to the EntitySet from which feature descriptions are built. Therefore, setting the description in any way other than going through the ``columns`` attribute will not set the column's description in a way that will be propogated to the feature description. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10e779f5",
   "metadata": {},
   "source": [
    "Descriptions must be set for a column in a DataFrame before the feature is created in order for descriptions to propagate. Note that if a description is both set directly on a column and passed to ``describe_feature`` with ``feature_descriptions``, the description in the `feature_descriptions` parameter will take presedence.\n",
    "\n",
    "Feature descriptions can also be provided for generated features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d1f8667",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_descriptions = {\n",
    "    \"sessions: SUM(transactions.amount)\": \"the total transaction amount for a session\"\n",
    "}\n",
    "\n",
    "feature_defs[14]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b90b8e4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.describe_feature(feature_defs[14], feature_descriptions=feature_descriptions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83217b19",
   "metadata": {},
   "source": [
    "Here, we create and pass in a custom description of the intermediate feature ``SUM(transactions.amount)``. The description for ``MEAN(sessions.SUM(transactions.amount))``, which is built on top of ``SUM(transactions.amount)``, uses the custom description in place of the automatically generated one. Feature descriptions can be passed in as a dictionary that maps the custom descriptions to either the feature object itself or the unique feature name in the form ``\"[dataframe_name]: [feature_name]\"``, as shown above.\n",
    "\n",
    "#### Primitive Templates\n",
    "Primitives descriptions are generated using primitive templates. By default, these are defined using the ``description_template`` attribute on the primitive. Primitives without a template default to using the ``name`` attribute of the primitive if it is defined, or the class name if it is not. Primitive description templates are string templates that take input feature descriptions as the positional arguments. These can be overwritten by mapping primitive instances or primitive names to custom templates and passing them into ``describe_feature`` through the ``primitive_templates`` argument. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50f1bfb8",
   "metadata": {},
   "outputs": [],
   "source": [
    "primitive_templates = {\"sum\": \"the total of {}\"}\n",
    "\n",
    "feature_defs[6]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1fb53a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.describe_feature(feature_defs[6], primitive_templates=primitive_templates)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b9cceca",
   "metadata": {},
   "source": [
    "In this example, we override the default template of ``'the sum of {}'`` with our custom template ``'the total of {}'``. The description uses our custom template instead of the default.\n",
    "\n",
    "Multi-output primitives can use a list of primitive description templates to differentiate between the generic multi-output feature description and the feature slice descriptions. The first primitive template is always the generic overall feature. If only one other template is provided, it is used as the template for all slices. The slice number converted to the \"nth\" form is available through the ``nth_slice`` keyword."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15ed472c",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature = feature_defs[5]\n",
    "feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54a5a6fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "primitive_templates = {\n",
    "    \"n_most_common\": [\n",
    "        \"the 3 most common elements of {}\",  # generic multi-output feature\n",
    "        \"the {nth_slice} most common element of {}\",\n",
    "    ]\n",
    "}  # template for each slice\n",
    "\n",
    "ft.describe_feature(feature, primitive_templates=primitive_templates)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49aae7d2",
   "metadata": {},
   "source": [
    "Notice how the multi-output feature uses the first template for its description. Each slice of this feature will use the second slice template:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1bd3a3cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.describe_feature(feature[0], primitive_templates=primitive_templates)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "607299ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.describe_feature(feature[1], primitive_templates=primitive_templates)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30f4235f",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.describe_feature(feature[2], primitive_templates=primitive_templates)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17953d54",
   "metadata": {},
   "source": [
    "Alternatively, instead of supplying a single template for all slices, templates can be provided for each slice to further customize the output. Note that in this case, each slice must get its own template."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bad05646",
   "metadata": {},
   "outputs": [],
   "source": [
    "primitive_templates = {\n",
    "    \"n_most_common\": [\n",
    "        \"the 3 most common elements of {}\",\n",
    "        \"the most common element of {}\",\n",
    "        \"the second most common element of {}\",\n",
    "        \"the third most common element of {}\",\n",
    "    ]\n",
    "}\n",
    "\n",
    "ft.describe_feature(feature, primitive_templates=primitive_templates)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdad1868",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.describe_feature(feature[0], primitive_templates=primitive_templates)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90a85bd0",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.describe_feature(feature[1], primitive_templates=primitive_templates)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b63d47a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.describe_feature(feature[2], primitive_templates=primitive_templates)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1942ea49",
   "metadata": {},
   "source": [
    "Custom feature descriptions and primitive templates can also be seperately defined in a JSON file and passed to the ``describe_feature`` function using the ``metadata_file`` keyword argument. Descriptions passed in directly through the ``feature_descriptions`` and ``primitive_templates`` keyword arguments will take precedence over any descriptions provided in the JSON metadata file."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: docs/source/guides/feature_selection.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Selection\n",
    "\n",
    "Featuretools provides users with the ability to remove features that are unlikely to be useful in building an effective machine learning model. Reducing the number of features in the feature matrix can both produce better results in the model as well as reduce the computational cost involved in prediction.\n",
    "\n",
    "Featuretools enables users to perform feature selection on the results of Deep Feature Synthesis with three functions:\n",
    "\n",
    "- `ft.selection.remove_highly_null_features`\n",
    "- `ft.selection.remove_single_value_features`\n",
    "- `ft.selection.remove_highly_correlated_features`\n",
    "\n",
    "We will describe each of these functions in depth, but first we must create an entity set with which we can run `ft.dfs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "import featuretools as ft\n",
    "from featuretools.demo.flight import load_flight\n",
    "from featuretools.selection import (\n",
    "    remove_highly_correlated_features,\n",
    "    remove_highly_null_features,\n",
    "    remove_single_value_features,\n",
    ")\n",
    "\n",
    "es = load_flight(nrows=50)\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Remove Highly Null Features\n",
    "\n",
    "We might have a dataset with columns that have many null values. Deep Feature Synthesis might build features off of those null columns, creating even more highly null features. In this case, we might want to remove any features whose null values pass a certain threshold. Below is our feature matrix with such a case:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fm, features = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"trip_logs\",\n",
    "    cutoff_time=pd.DataFrame(\n",
    "        {\n",
    "            \"trip_log_id\": [30, 1, 2, 3, 4],\n",
    "            \"time\": pd.to_datetime([\"2016-09-22 00:00:00\"] * 5),\n",
    "        }\n",
    "    ),\n",
    "    trans_primitives=[],\n",
    "    agg_primitives=[],\n",
    "    max_depth=2,\n",
    ")\n",
    "fm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We look at the above feature matrix and decide to remove the highly null features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.selection.remove_highly_null_features(fm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that calling `remove_highly_null_features` didn't remove every feature that contains a null value. By default, we only remove features where the percentage of null values in the calculated feature matrix is above 95%. If we want to lower that threshold, we can set the `pct_null_threshold` paramter ourselves."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "remove_highly_null_features(fm, pct_null_threshold=0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Remove Single Value Features\n",
    "\n",
    "Another situation we might run into is one where our calculated features don't have any variance. In those cases, we are likely to want to remove the uninteresting features. For that, we use `remove_single_value_features`.\n",
    "\n",
    "Let's see what happens when we remove the single value features of the feature matrix below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fm"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note ::\n",
    "    A list of feature definitions such as those created by `dfs` can be provided to the feature selection functions.\n",
    "    Doing this will change the outputs to include an updated list of feature definitions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_fm, new_features = remove_single_value_features(fm, features=features)\n",
    "new_fm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the features definitions for the updated feature matrix, we can see that the features that were removed are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set(features) - set(new_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the function used as it is above, null values are not considered when counting a feature's unique values. If we'd like to consider `NaN` its own value, we can set `count_nan_as_value` to `True` and we'll see `flights.carrier` and `flights.flight_num` back in the matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_fm, new_features = remove_single_value_features(\n",
    "    fm, features=features, count_nan_as_value=True\n",
    ")\n",
    "new_fm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The features that were removed are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set(features) - set(new_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Remove Highly Correlated Features\n",
    "\n",
    "The last feature selection function we have allows us to remove features that would likely be redundant to the model we're attempting to build by considering the correlation between pairs of calculated features.\n",
    "\n",
    "When two features are determined to be highly correlated, we remove the more complex of the two. For example, say we have two features: `col` and `-(col)`.\n",
    "\n",
    "We can see that `-(col)` is just the negation of `col`, and so we can guess those features are going to be highly correlated. `-(col)` has has the `Negate` primitive applied to it, so it is more complex than the identity feature `col`. Therefore, if we only want one of `col` and `-(col)`, we should keep the identity feature. For features that don't have an obvious difference in complexity, we discard the feature that comes later in the feature matrix. \n",
    "\n",
    "Let's try this out on our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fm, features = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"trip_logs\",\n",
    "    trans_primitives=[\"negate\"],\n",
    "    agg_primitives=[],\n",
    "    max_depth=3,\n",
    ")\n",
    "fm.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we have some pretty clear correlations here between all the features and their negations.\n",
    "\n",
    "Now, using `remove_highly_correlated_features`, our default threshold for correlation is 95% correlated, and we get all of the obviously correlated features removed, leaving just the less complex features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_fm, new_features = remove_highly_correlated_features(fm, features=features)\n",
    "new_fm.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The features that were removed are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set(features) - set(new_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Change the correlation threshold\n",
    "\n",
    "We can lower the threshold at which to remove correlated features if we'd like to be more restrictive by using the `pct_corr_threshold` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_fm, new_features = remove_highly_correlated_features(\n",
    "    fm, features=features, pct_corr_threshold=0.9\n",
    ")\n",
    "new_fm.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The features that were removed are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set(features) - set(new_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Check a Subset of Features\n",
    "\n",
    "If we only want to check a subset of features, we can set `features_to_check` to the list of features whose correlation we'd like to check, and no features outside of that list will be removed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_fm, new_features = remove_highly_correlated_features(\n",
    "    fm,\n",
    "    features=features,\n",
    "    features_to_check=[\"air_time\", \"distance\", \"flights.distance_group\"],\n",
    ")\n",
    "new_fm.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The features that were removed are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set(features) - set(new_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Protect Features from Removal\n",
    "\n",
    "To protect specific features from being removed from the feature matrix, we can include a list of `features_to_keep`, and these features will not be removed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_fm, new_features = remove_highly_correlated_features(\n",
    "    fm,\n",
    "    features=features,\n",
    "    features_to_keep=[\"air_time\", \"distance\", \"flights.distance_group\"],\n",
    ")\n",
    "new_fm.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The features that were removed are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set(features) - set(new_features)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "interpreter": {
   "hash": "eadebc3a8a3dd54e52de25d3077ea0e41c7a462ff73c567da199d6de4c02ed7d"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: docs/source/guides/guides_index.rst
================================================
Guides
---------------

Guides on more advanced Featuretools functionality

.. toctree::
   :maxdepth: 1

   tuning_dfs
   specifying_primitive_options
   performance
   deployment
   advanced_custom_primitives
   feature_descriptions
   feature_selection
   time_series
   sql_database_integration


================================================
FILE: docs/source/guides/performance.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "raw",
   "id": "2c5291f3",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. _performance:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9dab133a",
   "metadata": {},
   "source": [
    "# Improving Computational Performance\n",
    "\n",
    "Feature engineering is a computationally expensive task. While Featuretools comes with reasonable default settings for feature calculation, there are a number of built-in approaches to improve computational performance based on dataset and problem specific considerations.\n",
    "\n",
    "## Reduce number of unique cutoff times\n",
    "Each row in a feature matrix created by Featuretools is calculated at a specific cutoff time that represents the last point in time that data from any dataframe in an entityset can be used to calculate the feature. As a result, calculations incur an overhead in finding the subset of allowed data for each distinct time in the calculation."
   ]
  },
  {
   "cell_type": "raw",
   "id": "6ab1a83a",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note::\n",
    "\n",
    "    Featuretools is very precise in how it deals with time. For more information, see :doc:`/getting_started/handling_time`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "051fbaba",
   "metadata": {},
   "source": [
    "If there are many unique cutoff times, it is often worthwhile to figure out how to have fewer. This can be done manually by figuring out which unique times are necessary for the prediction problem or automatically using [approximate](../getting_started/handling_time.ipynb#Approximating-Features-by-Rounding-Cutoff-Times).\n",
    "\n",
    "## Parallel Feature Computation\n",
    "\n",
    "Computational performance can often be improved by parallelizing the feature calculation process. There are several different approaches that can be used to perform parallel feature computation with Featuretools. An overview of the most commonly used approaches is provided below."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b47e770f",
   "metadata": {},
   "source": [
    "\n",
    "### Simple Parallel Feature Computation\n",
    "If using a pandas `EntitySet`, Featuretools can optionally compute features on multiple cores. The simplest way to control the amount of parallelism is to specify the `n_jobs` parameter:\n",
    "\n",
    "```python3\n",
    "fm = ft.calculate_feature_matrix(features=features,\n",
    "                                 entityset=entityset,\n",
    "                                 cutoff_time=cutoff_time,\n",
    "                                 n_jobs=2,\n",
    "                                 verbose=True)\n",
    "```\n",
    "The above command will start 2 processes to compute chunks of the feature matrix in parallel. Each process receives its own copy of the entityset, so memory use will be proportional to the number of parallel processes. Because the entityset has to be copied to each process, there is overhead to perform this operation before calculation can begin. To avoid this overhead on successive calls to `calculate_feature_matrix`, read the section below on using a persistent cluster.\n",
    "\n",
    "#### Adjust chunk size\n",
    "By default, Featuretools calculates rows with the same cutoff time simultaneously. The *chunk_size* parameter limits the maximum number of rows that will be grouped and then calculated together. If calculation is done using parallel processing, the default chunk size is set to be `1 / n_jobs` to ensure the computation can be spread across available workers. Normally, this behavior works well, but if there are only a few unique cutoff times it can lead to higher peak memory usage (due to more intermediate calculations stored in memory) or limited parallelism (if the number of chunks is less than *n_jobs*).\n",
    "\n",
    "By setting `chunk_size`, we can limit the maximum number of rows in each group to specific number or a percentage of the overall data when calling `ft.dfs` or `ft.calculate_feature_matrix`:\n",
    "\n",
    "```python3\n",
    "# use maximum  100 rows per chunk\n",
    "feature_matrix, features_list = ft.dfs(entityset=es,\n",
    "                                       target_dataframe_name=\"customers\",\n",
    "                                       chunk_size=100)\n",
    "```\n",
    "\n",
    "We can also set chunk size to be a percentage of total rows:\n",
    "\n",
    "```python3\n",
    "# use maximum 5% of all rows per chunk\n",
    "feature_matrix, features_list = ft.dfs(entityset=es,\n",
    "                                       target_dataframe_name=\"customers\",\n",
    "                                       chunk_size=.05)\n",
    "```\n",
    "\n",
    "#### Using persistent cluster\n",
    "Behind the scenes, Featuretools uses [Dask's](http://dask.pydata.org/) distributed scheduler to implement multiprocessing. When you only specify the `n_jobs` parameter, a cluster will be created for that specific feature matrix calculation and destroyed once calculations have finished. A drawback of this is that each time a feature matrix is calculated, the entityset has to be transmitted to the workers again. To avoid this, we would like to reuse the same cluster between calls. The way to do this is by creating a cluster first and telling featuretools to use it with the `dask_kwargs` parameter:\n",
    "\n",
    "```python3\n",
    "import featuretools as ft\n",
    "from dask.distributed import LocalCluster\n",
    "\n",
    "cluster = LocalCluster()\n",
    "fm_1 = ft.calculate_feature_matrix(features=features_1,\n",
    "                                   entityset=entityset,\n",
    "                                   cutoff_time=cutoff_time,\n",
    "                                   dask_kwargs={'cluster': cluster},\n",
    "                                   verbose=True)\n",
    "```\n",
    "\n",
    "The 'cluster' value can either be the actual cluster object or a string of the address the cluster's scheduler can be reached at. The call below would also work. This second feature matrix calculation will not need to resend the entityset data to the workers because it has already been saved on the cluster.\n",
    "\n",
    "```python3\n",
    "fm_2 = ft.calculate_feature_matrix(features=features_2,\n",
    "                                   entityset=entityset,\n",
    "                                   cutoff_time=cutoff_time,\n",
    "                                   dask_kwargs={'cluster': cluster.scheduler.address},\n",
    "                                   verbose=True)\n",
    "```"
   ]
  },
  {
   "cell_type": "raw",
   "id": "57aaa835",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note::\n",
    "\n",
    "    When using a persistent cluster, Featuretools publishes a copy of the ``EntitySet`` to the cluster the first time it calculates a feature matrix. Based on the ``EntitySet``'s metadata the cluster will reuse it for successive computations. This means if two ``EntitySets`` have the same metadata but different row values (e.g. new data is added to the ``EntitySet``), Featuretools won’t recopy the second ``EntitySet`` in later calls. A simple way to avoid this scenario is to use a unique ``EntitySet`` id."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cdecad1d",
   "metadata": {},
   "source": [
    "#### Using the distributed dashboard\n",
    "Dask.distributed has a web-based diagnostics dashboard that can be used to analyze the state of the workers and tasks. It can also be useful for tracking memory use or visualizing task run-times. An in-depth description of the web interface can be found [here](https://distributed.readthedocs.io/en/latest/web.html).\n",
    "\n",
    "![Distributed dashboard image](../_static/images/dashboard.png)\n",
    "\n",
    "The dashboard requires an additional python package, bokeh, to work. Once bokeh is installed, the web interface will be launched by default when a LocalCluster is created. The cluster created by featuretools when using `n_jobs` does not enable the web interface automatically. To do so, the port to launch the main web interface on must be specified in `dask_kwargs`:\n",
    "\n",
    "```python3\n",
    "fm = ft.calculate_feature_matrix(features=features,\n",
    "                                 entityset=entityset,\n",
    "                                 cutoff_time=cutoff_time,\n",
    "                                 n_jobs=2,\n",
    "                                 dask_kwargs={'diagnostics_port': 8787}\n",
    "                                 verbose=True)\n",
    "```\n",
    "\n",
    "### Parallel Computation by Partitioning Data\n",
    "As an alternative to Featuretools' parallelization, the data can be partitioned and the feature calculations run on multiple cores or a cluster using Dask or Apache Spark with PySpark. This approach may be necessary with a large pandas `EntitySet` because the current parallel implementation sends the entire `EntitySet` to each worker which may exhaust the worker memory. Dask and Spark allow Featuretools to scale to multiple cores on a single machine or multiple machines on a cluster."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "795cc323",
   "metadata": {},
   "source": [
    "When an entire dataset is not required to calculate the features for a given set of instances, we can split the data into independent partitions and calculate on each partition. For example, imagine we are calculating features for customers and the features are \"number of other customers in this zip code\" or \"average age of other customers in this zip code\". In this case, we can load in data partitioned by zip code. As long as we have all of the data for a zip code when calculating, we can calculate all features for a subset of customers.\n",
    "\n",
    "An example of this approach can be seen in the [Predict Next Purchase demo notebook](https://github.com/featuretools/predict_next_purchase). In this example, we partition data by customer and only load a fixed number of customers into memory at any given time. We implement this easily using [Dask](https://dask.pydata.org/), which could also be used to scale the computation to a cluster of computers. A framework like [Spark](https://spark.apache.org/) could be used similarly.\n",
    "\n",
    "An additional example of partitioning data to distribute on multiple cores or a cluster using Dask can be seen in the [Featuretools on Dask notebook](https://github.com/Featuretools/Automated-Manual-Comparison/blob/main/Loan%20Repayment/notebooks/Featuretools%20on%20Dask.ipynb). This approach is detailed in the [Parallelizing Feature Engineering with Dask article](https://medium.com/feature-labs-engineering/scaling-featuretools-with-dask-ce46f9774c7d) on the Feature Labs engineering blog. Dask allows for simple scaling to multiple cores on a single computer or multiple machines on a cluster.\n",
    "\n",
    "For a similar partition and distribute implementation using Apache Spark with PySpark, refer to the [Feature Engineering on Spark notebook](https://github.com/Featuretools/predict-customer-churn/blob/main/churn/4.%20Feature%20Engineering%20on%20Spark.ipynb). This implementation shows how to carry out feature engineering on a cluster of EC2 instances using Spark as the distributed framework."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: docs/source/guides/specifying_primitive_options.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ba92172a",
   "metadata": {},
   "source": [
    "# Specifying Primitive Options\n",
    "\n",
    "By default, DFS will apply primitives across all dataframes and columns. This behavior can be altered through a few different parameters. Dataframes and columns can be optionally ignored or included for an entire DFS run or on a per-primitive basis, enabling greater control over features and less run time overhead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "106d36a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import featuretools as ft\n",
    "from featuretools.tests.testing_utils import make_ecommerce_entityset\n",
    "\n",
    "es = make_ecommerce_entityset()\n",
    "\n",
    "features_list = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"mode\"],\n",
    "    trans_primitives=[\"weekday\"],\n",
    "    features_only=True,\n",
    ")\n",
    "features_list"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29ae225d",
   "metadata": {},
   "source": [
    "## Specifying Options for an Entire Run\n",
    "\n",
    "The `ignore_dataframes` and `ignore_columns` parameters of DFS control dataframes and columns that should be ignored for all primitives. This is useful for ignoring columns or dataframes that don't relate to the problem or otherwise shouldn't be included in the DFS run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d481527",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ignore the 'log' and 'cohorts' dataframes entirely\n",
    "# ignore the 'birthday' column in 'customers' and the 'device_name' column in 'sessions'\n",
    "features_list = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"mode\"],\n",
    "    trans_primitives=[\"weekday\"],\n",
    "    ignore_dataframes=[\"log\", \"cohorts\"],\n",
    "    ignore_columns={\"sessions\": [\"device_name\"], \"customers\": [\"birthday\"]},\n",
    "    features_only=True,\n",
    ")\n",
    "features_list"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a9bd7e2",
   "metadata": {},
   "source": [
    "DFS completely ignores the `log` and `cohorts` dataframes when creating features. It also ignores the columns `device_name` and `birthday` in `sessions` and `customers` respectively. However, both of these options can be overridden by individual primitive options in the `primitive_options` parameter.\n",
    "\n",
    "## Specifying for Individual Primitives\n",
    "Options for individual primitives or groups of primitives are set by the `primitive_options` parameter of DFS. This parameter maps any desired options to specific primitives. In the case of conflicting options, options set at this level will override options set at the entire DFS run level, and the include options will always take priority over their ignore counterparts.\n",
    "\n",
    "Using the string primitive name or the primitive type will apply the options to all primitives of the same name. You can also set options for a specific instance of a primitive by using the primitive instance as a key in the `primitive_options` dictionary. Note, however, that specifying options for a specific instance will result in that instance ignoring any options set for the generic primitive through options with the primitive name or class as the key. \n",
    "\n",
    "### Specifying Dataframes for Individual Primitives\n",
    "Which dataframes to include/ignore can also be specified for a single primitive or a group of primitives. Dataframes can be ignored using the `ignore_dataframes` option in `primitive_options`, while dataframes to explicitly include are set by the ``include_dataframes`` option. When ``include_dataframes`` is given, all dataframes not listed are ignored by the primitive. No columns from any excluded dataframe will be used to generate features with the given primitive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bcbf11a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ignore the 'cohorts' and 'log' dataframes, but only for the primitive 'mode'\n",
    "# include only the 'customers' dataframe for the primitives 'weekday' and 'day'\n",
    "features_list = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"mode\"],\n",
    "    trans_primitives=[\"weekday\", \"day\"],\n",
    "    primitive_options={\n",
    "        \"mode\": {\"ignore_dataframes\": [\"cohorts\", \"log\"]},\n",
    "        (\"weekday\", \"day\"): {\"include_dataframes\": [\"customers\"]},\n",
    "    },\n",
    "    features_only=True,\n",
    ")\n",
    "features_list"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5cbbff0",
   "metadata": {},
   "source": [
    "In this example, DFS would only use the `customers` dataframe for both `weekday` and `day`, and would use all dataframes except `cohorts` and `log` for `mode`.\n",
    "\n",
    "### Specifying Columns for Individual Primitives\n",
    "\n",
    "Specific columns can also be explicitly included/ignored for a primitive or group of primitives. Columns to\n",
    "ignore is set by the `ignore_columns` option, while columns to include are set by `include_columns`. When the\n",
    "`include_columns` option is set, no other columns from that dataframe will be used to make features with the given primitive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9e42358",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Include the columns 'product_id' and 'zipcode', 'device_type', and 'cancel_reason' for 'mean'\n",
    "# Ignore the columns 'signup_date' and 'cancel_date' for 'weekday'\n",
    "features_list = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"mode\"],\n",
    "    trans_primitives=[\"weekday\"],\n",
    "    primitive_options={\n",
    "        \"mode\": {\n",
    "            \"include_columns\": {\n",
    "                \"log\": [\"product_id\", \"zipcode\"],\n",
    "                \"sessions\": [\"device_type\"],\n",
    "                \"customers\": [\"cancel_reason\"],\n",
    "            }\n",
    "        },\n",
    "        \"weekday\": {\"ignore_columns\": {\"customers\": [\"signup_date\", \"cancel_date\"]}},\n",
    "    },\n",
    "    features_only=True,\n",
    ")\n",
    "features_list"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88ea7094",
   "metadata": {},
   "source": [
    "Here, `mode` will only use the columns `product_id` and `zipcode` from the dataframe `log`, `device_type`\n",
    "from the dataframe `sessions`, and `cancel_reason` from `customers`. For any other dataframe, `mode` will use all\n",
    "columns. The `weekday` primitive will use all columns in all dataframes except for `signup_date` and `cancel_date`\n",
    "from the `customers` dataframe.\n",
    "\n",
    "\n",
    "### Specifying GroupBy Options\n",
    "\n",
    "GroupBy Transform Primitives also have the additional options `include_groupby_dataframes`, `ignore_groupby_dataframes`, `include_groupby_columns`, and `ignore_groupby_columns`. These options are used to specify dataframes and columns to include/ignore as groupings for inputs. By default, DFS only groups by foreign key columns. Specifying `include_groupby_columns` overrides this default, and will only group by columns given. On the other hand, `ignore_groupby_columns` will continue to use only the foreign key columns, ignoring any columns specified that are also foreign key columns. Note that if including non-foreign key columns to group by, the included columns must be categorical columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c1046b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "features_list = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"log\",\n",
    "    agg_primitives=[],\n",
    "    trans_primitives=[],\n",
    "    groupby_trans_primitives=[\"cum_sum\", \"cum_count\"],\n",
    "    primitive_options={\n",
    "        \"cum_sum\": {\"ignore_groupby_columns\": {\"log\": [\"product_id\"]}},\n",
    "        \"cum_count\": {\n",
    "            \"include_groupby_columns\": {\"log\": [\"product_id\", \"priority_level\"]},\n",
    "            \"ignore_groupby_dataframes\": [\"sessions\"],\n",
    "        },\n",
    "    },\n",
    "    features_only=True,\n",
    ")\n",
    "features_list"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10616725",
   "metadata": {},
   "source": [
    "We ignore `product_id` as a groupby for `cum_sum` but still use any other foreign key columns in that or any other dataframe. For `cum_count`, we use only `product_id` and `priority_level` as groupbys. Note that `cum_sum` doesn't use\n",
    "`priority_level` because it's not a foreign key column, but we explicitly include it for `cum_count`. Finally, note that specifying groupby options doesn't affect what features the primitive is applied to. For example, `cum_count` ignores the dataframe `sessions` for groupbys, but the feature `<Feature: CUM_COUNT(sessions.device_name) by product_id>` is still made. The groupby is from the target dataframe `log`, so the feature is valid given the associated options. To ignore the `sessions` dataframe for `cum_count`,  the `ignore_dataframes` option for `cum_count` would need to include `sessions`.\n",
    "\n",
    "\n",
    "## Specifying for each Input for Multiple Input Primitives\n",
    "\n",
    "For primitives that take multiple columns as input, such as `Trend`, the above options can be specified for each input by passing them in as a list. If only one option dictionary is given, it is used for all inputs. The length of the list provided must match the number of inputs the primitive takes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e808749",
   "metadata": {},
   "outputs": [],
   "source": [
    "features_list = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"trend\"],\n",
    "    trans_primitives=[],\n",
    "    primitive_options={\n",
    "        \"trend\": [\n",
    "            {\"ignore_columns\": {\"log\": [\"value_many_nans\"]}},\n",
    "            {\"include_columns\": {\"customers\": [\"signup_date\"], \"log\": [\"datetime\"]}},\n",
    "        ]\n",
    "    },\n",
    "    features_only=True,\n",
    ")\n",
    "features_list"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53d5d207",
   "metadata": {},
   "source": [
    "Here, we pass in a list of primitive options for trend.  We ignore the column `value_many_nans` for the first input\n",
    "to `trend`, and include the column `signup_date` from `customers` for the second input."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: docs/source/guides/sql_database_integration.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SQL Database Integration \n",
    "\n",
    "`featuretools_sql` is an add-on library that supports automatic `EntitySet` creation from a relational database.\n",
    "\n",
    "Currently, `featuretools_sql` is compatible with the following systems:\n",
    "\n",
    "* `MySQL` \n",
    "* `PostgreSQL`\n",
    "* `Snowflake`\n",
    "\n",
    "The `DBConnector` object exposed by the `featuretools_sql` library provides the interface to connecting to the DBMS."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Installing featuretools_sql \n",
    "\n",
    "Install with pip\n",
    "\n",
    "```\n",
    "python -m pip install \"featuretools[sql]\" \n",
    "``` "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Connecting to your database instance \n",
    "\n",
    "Depending on your choice of DBMS, you may have to provide different pieces of information to the `DBConnector` object.\n",
    "\n",
    "If you want to connect to a `MySQL` instance, you must pass the string `\"mysql\"` into the `system_name` argument.\n",
    "\n",
    "If you want to connect to a `PostgreSQL` instance, you must pass the string `\"postgresql\"` into the `system_name` argument.\n",
    "\n",
    "If you want to connect to a `Snowflake` instance, you must pass the string `\"snowflake\"` into the `system_name` argument.\n",
    "\n",
    "Here is an example call to the constructor of the object, connecting to a `PostgreSQL` database:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python \n",
    "from featuretools_sql.connector import DBConnector\n",
    "\n",
    "connector_object = DBConnector(\n",
    "    system_name=\"postgresql\",\n",
    "    user=\"postgres\",\n",
    "    host=\"localhost\",\n",
    "    port=\"5432\",\n",
    "    database=\"postgres\",\n",
    "    schema=\"public\",\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the choice of RDBMS does affect the required arguments -- for example, if you were connecting to a `MySQL` instance, you would not need a `schema` argument.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Converting to an EntitySet \n",
    "\n",
    "You can call the `get_entityset` method to instruct the `DBConnector` object to build an EntitySet. \n",
    "\n",
    "This method will loop through all the tables in the database and copy them into dataframes. Then it will populate the relationships data structure. It will finally pass those two arguments into the EntitySet constructor in Featuretools, and return the object."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python \n",
    "es = connector_object.get_entityset()\n",
    "``` \n",
    "\n",
    "Optionally, you can pass in table names to the `select_only` parameter if you only want to include a subset of the tables in the database. \n",
    "\n",
    "```python \n",
    "es = connector_object.get_entityset(select_only=[\"Products\", \"Transactions\"])\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Examining the EntitySet's member data \n",
    "\n",
    "You can examine the member data of the `DBConnector` object to ensure that it imported data correctly.\n",
    "\n",
    "To access the dataframes it imported, access the `.dataframes` attribute. To access the relationships data structure, access the `.relationships` attribute.\n",
    "\n",
    "If you would like to visualize the EntitySet as a graph, you can call `es.plot()`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calling DFS \n",
    "\n",
    "The EntitySet object is ready to be passed into Featuretools's `DFS` algorithm! Read more about `DFS` [here]([https://featuretools.alteryx.com/en/stable/getting_started/afe.html#running-dfs). "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.12 64-bit ('venv_x86')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  },
  "vscode": {
   "interpreter": {
    "hash": "3f6b062a214ec48d1657976024d6bc68979519d14a33afb6ad033fc2e4189514"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: docs/source/guides/time_series.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17f894b5",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "import pandas as pd\n",
    "\n",
    "import featuretools as ft\n",
    "from featuretools.demo.weather import load_weather\n",
    "from featuretools.primitives import Lag, RollingMean, RollingMin"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8104f18",
   "metadata": {},
   "source": [
    "# Feature Engineering for Time Series Problems"
   ]
  },
  {
   "cell_type": "raw",
   "id": "9cd9cb82",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note::\n",
    "        This guide focuses on feature engineering for single-table time series problems; it does not cover how to handle temporal multi-table data for other machine learning problem types. A more general guide on handling time in Featuretools can be found `here <../getting_started/handling_time.ipynb>`_."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cf3cebc",
   "metadata": {},
   "source": [
    "Time series forecasting consists of predicting future values of a target using earlier observations. In datasets that are used in time series problems, there is an inherent temporal ordering to the data (determined by a time index), and  the sequential target values we're predicting are highly dependent on one another. Feature engineering for time series problems exploits the fact that more recent observations are more predictive than more distant ones.\n",
    "\n",
    "This guide will explore how to use Featuretools for automating feature engineering for univariate time series problems, or problems in which only the time index and target column are included.\n",
    " \n",
    "We'll be working with a temperature demo EntitySet that contains one DataFrame, `temperatures`. The `temperatures` dataframe contains the minimum daily temperatures that we will be predicting. In total, it has three columns: `id`, `Temp`, and `Date`. The `id` column is the index that is necessary for Featuretools' purposes. The other two are important for univariate time series problems: `Date` is our time index, and `Temp` is our target column. The engineered features will be built from these two columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "862e46da",
   "metadata": {},
   "outputs": [],
   "source": [
    "es = load_weather()\n",
    "\n",
    "es[\"temperatures\"].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90242e31",
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"temperatures\"][\"Temp\"].plot(ylabel=\"Temp (C)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "060eb035",
   "metadata": {},
   "source": [
    "## Understanding The Feature Engineering Window\n",
    "\n",
    "In multi-table datasets, a feature engineering window for a single row in the target DataFrame extends forward in time over observations in child DataFrames starting at the time index and ending when either the cutoff time or last time index is reached. \n",
    "\n",
    "![Multi Table Timeline](../_static/images/multi_table_FE_timeline.png)\n",
    "\n",
    "In single-table time series datasets, the feature engineering window for a single value extends backwards in time within the same column. Because of this, the concepts of cutoff time and last time index are not relevant in the same way.\n",
    "\n",
    "For example: The cutoff time for a single-table time series dataset would create the training and test data split. During DFS, features would not be calculated after the cutoff time. This same behavior can often times be achieved more simply by splitting the data prior to creating the EntitySet, since filtering the data at feature matrix calculation is more computationally intensive than splitting the data ahead of time.\n",
    "\n",
    "```\n",
    "split_point = int(df.shape[0]*.7)\n",
    "\n",
    "training_data = df[:split_point]\n",
    "test_data = df[split_point:]\n",
    "```\n",
    "\n",
    "So, since we can't use the existing parameters for defining each observation's feature engineering window, we'll need to define new the concepts of `gap` and `window_length`. These will allow us to set a feature engineering window that exists prior to each observation.\n",
    "\n",
    "## Gap and Window Length\n",
    "\n",
    "Note that we will be using integers when defining the gap and window length. This implies that our data occurs at evenly spaced intervals--in this case daily--so a number `n` corresponds to `n` days. Support for unevenly spaced intervals is ongoing and can be explored with the Woodwork method [df.ww.infer_temporal_frequencies](https://woodwork.alteryx.com/en/stable/generated/woodwork.table_accessor.WoodworkTableAccessor.infer_temporal_frequencies.html#woodwork.table_accessor.WoodworkTableAccessor.infer_temporal_frequencies).\n",
    "\n",
    "If we are at a point in time `t`, we have access to information from times less than `t` (past values), and we do not have information from times greater than `t` (future values). Our limitations in feature engineering, then, will come from when exactly before `t` we have access to the data. \n",
    "\n",
    "Consider an example where we're recording data that takes a week to ingest; the earliest data we have access to is from seven days ago, or `t - 7`. We'll call this our `gap`. A `gap` of 0 would include the instance itself, which we must be careful to avoid in time series problems, as this exposes our target.\n",
    "\n",
    "We also need to determine how far back in time before `t - 7` we can go. Too far back, and we may lose the potency of our recent observations, but too recent, and we may not capture the full spectrum of behaviors displayed by the data. In this example, let's say that we only want to look at 5 days worth of data at a time. We'll call this our `window_length`. \n",
    "\n",
    "![Time Series Timeline](../_static/images/time_series_FE_timeline.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a90799f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "gap = 7\n",
    "window_length = 5"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "460b4c49",
   "metadata": {},
   "source": [
    "With these two parameters (`gap` and `window_length`) set, we have defined our feature engineering window. Now, we can move onto defining our feature primitives.\n",
    "\n",
    "## Time Series Primitives\n",
    "\n",
    "There are three types of primitives we'll focus on for time series problems. One of them will extract features from the time index, and the other two types will extract features from our target column. \n",
    "\n",
    "### Datetime Transform Primitives\n",
    "\n",
    "We need a way of implicating time in our time series features. Yes, using recent temperatures is incredibly predictive in determining future temperatures, but there is also a whole host of historical data suggesting that the month of the year is a pretty good indicator for the temperature outside. However, if we look at the data, we'll see that, though the day changes, the observations are always taken at the same hour, so the `Hour` primitive will not likely be useful. Of course, in a dataset that is measured at an hourly frequency or one more granular, `Hour` may be incrediby predictive. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65246092",
   "metadata": {},
   "outputs": [],
   "source": [
    "datetime_primitives = [\"Day\", \"Year\", \"Weekday\", \"Month\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95d8c86a",
   "metadata": {},
   "source": [
    "The full list of datetime transform primitives can be seen [here](https://featuretools.alteryx.com/en/latest/api_reference.html#datetime-transform-primitives).\n",
    "\n",
    "### Delaying Primitives\n",
    "\n",
    "The simplest thing we can do with our target column is to build features that are delayed (or lagging) versions of the target column. We'll make one feature per observation in our feature engineering windows, so we'll range over time from `t - gap - window_length` to `t - gap`. \n",
    "\n",
    "For this purpose, we can use our `Lag` primitive and create one primitive for each instance in our window. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9e1fa8f",
   "metadata": {},
   "outputs": [],
   "source": [
    "delaying_primitives = [Lag(periods=i + gap) for i in range(window_length)]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03cd4474",
   "metadata": {},
   "source": [
    "### Rolling Transform Primitives\n",
    "\n",
    "Since we have access to the entire feature engineering window, we can aggregate over that window. Featuretools has several rolling primitives with which we can achieve this. Here, we'll use the `RollingMean` and `RollingMin` primitives, setting the `gap` and `window_length` accordingly. Here, the gap is incredibly important, because when the gap is zero, it means the current observation's taret value is present in the window, which exposes our target.\n",
    "\n",
    "This concern also exists for other primitives that reference earlier values in the dataframe. Because of this, when using primitives for time series feature engineering, one must be incredibly careful to not use primitives on the target column that incorporate the current observation when calculating a feature value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed6cc722",
   "metadata": {},
   "outputs": [],
   "source": [
    "rolling_mean_primitive = RollingMean(\n",
    "    window_length=window_length, gap=gap, min_periods=window_length\n",
    ")\n",
    "\n",
    "rolling_min_primitive = RollingMin(\n",
    "    window_length=window_length, gap=gap, min_periods=window_length\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1eb2a6e1",
   "metadata": {},
   "source": [
    "The full list of rolling transform primitives can be seen [here](https://featuretools.alteryx.com/en/latest/api_reference.html#rolling-transform-primitives).\n",
    "\n",
    "## Run DFS\n",
    "\n",
    "Now that we've definied our time series primitives, we can pass them into DFS and get our feature matrix! \n",
    "\n",
    "Let's take a look at an actual feature engineering window as we defined with `gap` and `window_length` above. Below is an example of how we can extract many features using the same feature engineering window without exposing our target value.\n",
    "\n",
    "![FE Window](../_static/images/window_calculations.png)\n",
    "\n",
    "With the image above, we see how all of our defined primitives get used to create many features from just the two columns we have access to."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42f52b73",
   "metadata": {},
   "outputs": [],
   "source": [
    "fm, f = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"temperatures\",\n",
    "    trans_primitives=(\n",
    "        datetime_primitives\n",
    "        + delaying_primitives\n",
    "        + [rolling_mean_primitive, rolling_min_primitive]\n",
    "    ),\n",
    "    cutoff_time=pd.Timestamp(\"1987-1-30\"),\n",
    ")\n",
    "\n",
    "f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e8ce29d",
   "metadata": {},
   "outputs": [],
   "source": [
    "fm.iloc[:, [0, 2, 6, 7, 8, 9]].head(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b984ff57",
   "metadata": {},
   "source": [
    "Above is our time series feature matrix! The rolling and delayed features are built from our target column, but do not expose it. We can now use the feature matrix to create a machine learning model that predicts future minimum daily temperatures."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: docs/source/guides/tuning_dfs.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a4329c7d",
   "metadata": {},
   "source": [
    "# Tuning Deep Feature Synthesis\n",
    "\n",
    "There are several parameters that can be tuned to change the output of DFS. We'll explore these parameters using the following `transactions` EntitySet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12607fd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import featuretools as ft\n",
    "\n",
    "es = ft.demo.load_mock_customer(return_entityset=True)\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ef15160",
   "metadata": {},
   "source": [
    "## Using \"Seed Features\"\n",
    "\n",
    "Seed features are manually defined and problem specific features that a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.\n",
    "\n",
    "By using seed features, we can include domain specific knowledge in feature engineering automation. For the seed feature below, the domain knowlege may be that, for a specific retailer, a transaction above $125 would be considered an expensive purchase."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b35f388e",
   "metadata": {},
   "outputs": [],
   "source": [
    "expensive_purchase = ft.Feature(es[\"transactions\"].ww[\"amount\"]) > 125\n",
    "\n",
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"percent_true\"],\n",
    "    seed_features=[expensive_purchase],\n",
    ")\n",
    "feature_matrix[[\"PERCENT_TRUE(transactions.amount > 125)\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8703d4b3",
   "metadata": {},
   "source": [
    "We can now see that the ``PERCENT_TRUE`` primitive was automatically applied to the boolean `expensive_purchase` feature from the `transactions` table. The feature produced as a result can be understood as the percentage of transactions for a customer that are considered expensive.\n",
    "\n",
    "## Add \"interesting\" values to columns\n",
    "\n",
    "Sometimes we want to create features that are conditioned on a second value before calculations are performed. We call this extra filter a \"where clause\". Where clauses are used in Deep Feature Synthesis by including primitives in the `where_primitives` parameter to DFS.\n",
    "\n",
    "By default, where clauses are built using the ``interesting_values`` of a column.\n",
    "\n",
    "Interesting values can be automatically determined and added for each DataFrame in a pandas EntitySet by calling `es.add_interesting_values()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6e88923",
   "metadata": {},
   "outputs": [],
   "source": [
    "values_dict = {\"device\": [\"desktop\", \"mobile\", \"tablet\"]}\n",
    "es.add_interesting_values(dataframe_name=\"sessions\", values=values_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "beee9073",
   "metadata": {},
   "source": [
    "Interesting values are stored in the DataFrame's Woodwork typing information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c70ff02e",
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"sessions\"].ww.columns[\"device\"].metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ddec8e5a",
   "metadata": {},
   "source": [
    "Now that interesting values are set for the `device` column in the `sessions` table, we can specify the aggregation primitives for which we want where clauses using the ``where_primitives`` parameter to DFS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6eaabad8",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"count\", \"avg_time_between\"],\n",
    "    where_primitives=[\"count\", \"avg_time_between\"],\n",
    "    trans_primitives=[],\n",
    ")\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "681a19db",
   "metadata": {},
   "source": [
    "Now, we have several new potentially useful features. Here are two of them that are built off of the where clause \"where the device used was a tablet\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31a2a94e",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix[\n",
    "    [\n",
    "        \"COUNT(sessions WHERE device = tablet)\",\n",
    "        \"AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)\",\n",
    "    ]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b43a4a5",
   "metadata": {},
   "source": [
    "The first geature, `COUNT(sessions WHERE device = tablet)`, can be understood as indicating *how many sessions a customer completed on a tablet*.\n",
    "\n",
    "The second feature, `AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)`, calculates *the time between those sessions*.\n",
    "\n",
    "We can see that customer who only had 0 or 1 sessions on a tablet had ``NaN`` values for average time between such sessions.\n",
    "\n",
    "\n",
    "## Encoding categorical features\n",
    "\n",
    "Machine learning algorithms typically expect all numeric data or data that has defined numeric representations, like boolean values corresponding to `0` and `1`. When Deep Feature Synthesis generates categorical features, we can encode them using Featureools."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2ccb27b",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"mode\"],\n",
    "    trans_primitives=[\"time_since\"],\n",
    "    max_depth=1,\n",
    ")\n",
    "\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a50adb54",
   "metadata": {},
   "source": [
    "This feature matrix contains 2 columns that are categorical in nature, ``zip_code`` and ``MODE(sessions.device)``. We can use the feature matrix and feature definitions to encode these categorical values into boolean values. Featuretools offers functionality to apply one hot encoding to the output of DFS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "088672ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)\n",
    "feature_matrix_enc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54076098",
   "metadata": {},
   "source": [
    "The returned feature matrix is now encoded in a way that is interpretable to machine learning algorithms. Notice how the columns that did not need encoding are still included. Additionally, we get a new set of feature definitions that contain the encoded values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db8dd84b",
   "metadata": {},
   "outputs": [],
   "source": [
    "features_enc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4bda3a2",
   "metadata": {},
   "source": [
    "These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read the [Deployment](deployment.ipynb) guide."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: docs/source/index.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "raw",
   "id": "25bd9564",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. _quick-start:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4746904c",
   "metadata": {},
   "source": [
    "# What is Featuretools?\n",
    "<img src=\"_static/images/featuretools_nav2.svg\" width=\"500\" align=\"center\" alt=\"Featuretools\">\n",
    "\n",
    "\n",
    "**Featuretools** is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.\n",
    "\n",
    "\n",
    "## 5 Minute Quick Start\n",
    "\n",
    "Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ed1924f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import featuretools as ft"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bc51d89",
   "metadata": {},
   "source": [
    "#### Load Mock Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be39a49a",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = ft.demo.load_mock_customer()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb2552f2",
   "metadata": {},
   "source": [
    "#### Prepare data\n",
    "\n",
    "In this toy dataset, there are 3 DataFrames.\n",
    "\n",
    "- **customers**: unique customers who had sessions\n",
    "- **sessions**: unique sessions and associated attributes\n",
    "- **transactions**: list of events in this session\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9bb55d86",
   "metadata": {},
   "outputs": [],
   "source": [
    "customers_df = data[\"customers\"]\n",
    "customers_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2054eb2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "sessions_df = data[\"sessions\"]\n",
    "sessions_df.sample(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "348e7614",
   "metadata": {},
   "outputs": [],
   "source": [
    "transactions_df = data[\"transactions\"]\n",
    "transactions_df.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59fc2126",
   "metadata": {},
   "source": [
    "First, we specify a dictionary with all the DataFrames in our dataset. The DataFrames are passed in with their index column and time index column if one exists for the DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3fdc96a",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataframes = {\n",
    "    \"customers\": (customers_df, \"customer_id\"),\n",
    "    \"sessions\": (sessions_df, \"session_id\", \"session_start\"),\n",
    "    \"transactions\": (transactions_df, \"transaction_id\", \"transaction_time\"),\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0d84890",
   "metadata": {},
   "source": [
    "Second, we specify how the DataFrames are related. When two DataFrames have a one-to-many relationship, we call the \"one\" DataFrame, the \"parent DataFrame\". A relationship between a parent and child is defined like this:\n",
    "    \n",
    "    (parent_dataframe, parent_column, child_dataframe, child_column)\n",
    "\n",
    "In this dataset we have two relationships"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc4366dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "relationships = [\n",
    "    (\"sessions\", \"session_id\", \"transactions\", \"session_id\"),\n",
    "    (\"customers\", \"customer_id\", \"sessions\", \"customer_id\"),\n",
    "]"
   ]
  },
  {
   "cell_type": "raw",
   "id": "758f8fd4",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note::\n",
    "\n",
    "    To manage setting up DataFrames and relationships, we recommend using the :class:`EntitySet <featuretools.EntitySet>` class which offers convenient APIs for managing data like this. See :doc:`getting_started/using_entitysets` for more information."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "330d66b0",
   "metadata": {},
   "source": [
    "#### Run Deep Feature Synthesis\n",
    "\n",
    "A minimal input to DFS is a dictionary of DataFrames, a list of relationships, and the name of the target DataFrame whose features we want to calculate. The ouput of DFS is a feature matrix and the corresponding list of feature definitions.\n",
    "\n",
    "Let's first create a feature matrix for each customer in the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13cae382",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix_customers, features_defs = ft.dfs(\n",
    "    dataframes=dataframes,\n",
    "    relationships=relationships,\n",
    "    target_dataframe_name=\"customers\",\n",
    ")\n",
    "feature_matrix_customers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71628a1c",
   "metadata": {},
   "source": [
    "We now have dozens of new features to describe a customer's behavior.\n",
    "\n",
    "#### Change target DataFrame\n",
    "One of the reasons DFS is so powerful is that it can create a feature matrix for *any* DataFrame in our EntitySet. For example, if we wanted to build features for sessions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4cfe1aca",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "dataframes = {\n",
    "    \"customers\": (customers_df.copy(), \"customer_id\"),\n",
    "    \"sessions\": (sessions_df.copy(), \"session_id\", \"session_start\"),\n",
    "    \"transactions\": (transactions_df.copy(), \"transaction_id\", \"transaction_time\"),\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "84fec203",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix_sessions, features_defs = ft.dfs(\n",
    "    dataframes=dataframes, relationships=relationships, target_dataframe_name=\"sessions\"\n",
    ")\n",
    "feature_matrix_sessions.head(5)"
   ]
  },
  {
   "cell_type": "raw",
   "id": "a67d574e",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Understanding Feature Output\n",
    "~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n",
    "\n",
    "In general, Featuretools references generated features through the feature name. In order to make features easier to understand, Featuretools offers two additional tools, :func:`featuretools.graph_feature` and :func:`featuretools.describe_feature`, to help explain what a feature is and the steps Featuretools took to generate it. Let's look at this example feature:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c791dda",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature = features_defs[18]\n",
    "feature"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84b5be0f",
   "metadata": {},
   "source": [
    "##### Feature lineage graphs\n",
    "\n",
    "Feature lineage graphs visually walk through feature generation. Starting from the base data, they show step by step the primitives applied and intermediate features generated to create the final feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0cd93f3d",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.graph_feature(feature)"
   ]
  },
  {
   "cell_type": "raw",
   "id": "d6e5e0a1",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. graphviz:: getting_started/graphs/demo_feat.dot\n",
    "\n",
    "Feature descriptions\n",
    "\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\n",
    "\n",
    "Featuretools can also automatically generate English sentence descriptions of features. Feature descriptions help to explain what a feature is, and can be further improved by including manually defined custom definitions. See :doc:`/guides/feature_descriptions` for more details on how to customize automatically generated feature descriptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3bdbe1c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "ft.describe_feature(feature)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44635e1f",
   "metadata": {},
   "source": [
    "## What's next?\n",
    "\n",
    "\n",
    "* Learn about [Representing Data with EntitySets](getting_started/using_entitysets.ipynb)\n",
    "* Apply automated feature engineering with [Deep Feature Synthesis](getting_started/afe.ipynb)\n",
    "* Explore [runnable demos](https://www.featuretools.com/demos) based on real world use cases\n",
    "* Can't find what you're looking for? Ask for [help](resources/help.rst)"
   ]
  },
  {
   "cell_type": "raw",
   "id": "cb2d443c",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Table of contents\n",
    "-----------------\n",
    "\n",
    ".. toctree::\n",
    "   :maxdepth: 1\n",
    "\n",
    "   install\n",
    "\n",
    ".. toctree::\n",
    "   :maxdepth: 2\n",
    "\n",
    "   getting_started/getting_started_index\n",
    "   guides/guides_index\n",
    "\n",
    ".. toctree::\n",
    "   :maxdepth: 1\n",
    "   :caption: Resources and References\n",
    "\n",
    "   resources/resources_index\n",
    "   api_reference\n",
    "   release_notes\n",
    "\n",
    "Other links\n",
    "------------\n",
    "* :ref:`genindex`\n",
    "* :ref:`search`\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: docs/source/install.md
================================================
# Install

Featuretools is available for Python 3.9 - 3.12. It can be installed from [pypi](https://pypi.org/project/featuretools/), [conda-forge](https://anaconda.org/conda-forge/featuretools), or from [source](https://github.com/alteryx/featuretools).

To install Featuretools, run the following command:

````{tab} PyPI
```console
$ python -m pip install featuretools
```
````

````{tab} Conda
```console
$ conda install -c conda-forge featuretools
```
````

## Add-ons

Featuretools allows users to install add-ons individually or all at once:

````{tab} PyPI
```{tab} All Add-ons
```console
$ python -m pip install "featuretools[complete]"
```
```{tab} Dask
```console
$ python -m pip install "featuretools[dask]"
```
```{tab} NLP Primitives
```console
$ python -m pip install "featuretools[nlp]"
```
```{tab} Premium Primitives
```console
$ python -m pip install "featuretools[premium]"
```

````
````{tab} Conda
```{tab} All Add-ons
```console
$ conda install -c conda-forge nlp-primitives dask distributed
```
```{tab} NLP Primitives
```console
$ conda install -c conda-forge nlp-primitives
```
```{tab} Dask
```console
$ conda install -c conda-forge dask distributed
```
````

- **NLP Primitives**: Use Natural Language Processing Primitives in Featuretools
- **Premium Primitives**: Use primitives from Premium Primitives in Featuretools
- **Dask**: Use to run `calculate_feature_matrix` in parallel with `n_jobs`

## Installing Graphviz

In order to use `EntitySet.plot` or `featuretools.graph_feature` you will need to install the graphviz library.

````{tab} macOS (Intel, M1)
:new-set:
```{tab} pip
```console
$ brew install graphviz
$ python -m pip install graphviz
```
```{tab} conda
```console
$ brew install graphviz
$ conda install -c conda-forge python-graphviz
```
````

````{tab} Ubuntu
```{tab} pip
```console
$ sudo apt install graphviz
$ python -m pip install graphviz
```
```{tab} conda
```console
$ sudo apt install graphviz
$ conda install -c conda-forge python-graphviz
```
````

````{tab} Windows
```{tab} pip
```console
$ python -m pip install graphviz
```
```{tab} conda
```console
$ conda install -c conda-forge python-graphviz
```
````

If you installed graphviz for **Windows** with `pip`, install graphviz.exe from the [official source](https://graphviz.org/download/#windows).

## Source

To install Featuretools from source, clone the repository from [GitHub](https://github.com/alteryx/featuretools), and install the dependencies.

```bash
git clone https://github.com/alteryx/featuretools.git
cd featuretools
python -m pip install .
```

## Docker

It is also possible to run Featuretools inside a Docker container.
You can do so by installing it as a package inside a container (following the normal install guide) or
creating a new image with Featuretools pre-installed, using the following commands in your `Dockerfile`:

```dockerfile
FROM --platform=linux/x86_64 python:3.9-slim-buster
RUN apt update && apt -y update
RUN apt install -y build-essential
RUN pip3 install --upgrade --quiet pip
RUN pip3 install featuretools
```

# Development

To make contributions to the codebase, please follow the guidelines [here](https://github.com/alteryx/featuretools/blob/main/contributing.md).


================================================
FILE: docs/source/release_notes.rst
================================================
.. _release_notes:

Release Notes
-------------

Future Release
==============
    * Enhancements
    * Fixes
    * Changes
        * Restrict numpy to <2.0.0 (:pr:`2743`)
    * Documentation Changes
        * Update API Docs to include previously missing primitives (:pr:`2737`)
    * Testing Changes

    Thanks to the following people for contributing to this release:
    :user:`thehomebrewnerd`

v1.31.0 May 14, 2024
====================
    * Enhancements
        * Add support for Python 3.12 (:pr:`2713`)
    * Fixes
        * Move ``flatten_list`` util function into ``feature_discovery`` module to fix import bug (:pr:`2702`)
    * Changes
        * Temporarily restrict Dask version (:pr:`2694`)
        * Remove support for creating ``EntitySets`` from Dask or Pyspark dataframes (:pr:`2705`)
        * Bump minimum versions of ``tqdm`` and ``pip`` in requirements files (:pr:`2716`)
        * Use ``filter`` arg in call to ``tarfile.extractall`` to safely deserialize EntitySets (:pr:`2722`)
    * Testing Changes
        * Fix serialization test to work with pytest 8.1.1 (:pr:`2694`)
        * Update to allow minimum dependency checker to run properly (:pr:`2709`)
        * Update pull request check CI action (:pr:`2720`)
        * Update release notes updated check CI action (:pr:`2726`)

    Thanks to the following people for contributing to this release:
    :user:`thehomebrewnerd`

Breaking Changes
++++++++++++++++
* With this release of Featuretools, EntitySets can no longer be created from Dask or Pyspark dataframes. The behavior when using pandas
  dataframes to create EntitySets remains unchanged.


v1.30.0 Feb 26, 2024
====================
    * Changes
        * Update min requirements for numpy, pandas and Woodwork (:pr:`2681`)
        * Update release notes version for release(:pr:`2689`)
    * Testing Changes
        * Update ``make_ecommerce_entityset`` to work without Dask (:pr:`2677`)

    Thanks to the following people for contributing to this release:
    :user:`tamargrey`, :user:`thehomebrewnerd`

v1.29.0 Feb 16, 2024
====================
    .. warning::
        This release of Featuretools will not support Python 3.8

    * Fixes
        * Fix dependency issues (:pr:`2644`, :pr:`2656`)
        * Add workaround for pandas 2.2.0 bug with nunique and unpin pandas deps (:pr:`2657`)
    * Changes
        * Fix deprecation warnings with is_categorical_dtype (:pr:`2641`)
        * Remove woodwork, pyarrow, numpy, and pandas pins for spark installation (:pr:`2661`)
    * Documentation Changes
        * Update Featuretools logo to display properly in dark mode (:pr:`2632`)
        * Remove references to premium primitives while release isnt possible (:pr:`2674`)
    * Testing Changes
        * Update tests for compatibility with new versions of ``holidays`` (:pr:`2636`)
        * Update ruff to 0.1.6 and use ruff linter/formatter (:pr:`2639`)
        * Update ``release.yaml`` to use trusted publisher for PyPI releases (:pr:`2646`, :pr:`2653`, :pr:`2654`)
        * Update dependency checkers and tests to include Dask (:pr:`2658`)
        * Fix the tests that run with Woodwork main so they can be triggered (:pr:`2657`)
        * Fix minimum dependency checker action (:pr:`2664`)
        * Fix Slack alert for tests with Woodwork main branch (:pr:`2668`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`thehomebrewnerd`, :user:`tamargrey`, :user:`LakshmanKishore`


v1.28.0 Oct 26, 2023
====================
    * Fixes
        * Fix bug with default value in ``PercentTrue`` primitive (:pr:`2627`)
    * Changes
        * Refactor ``featuretools/tests/primitive_tests/utils.py`` to leverage list comprehensions for improved Pythonic quality (:pr:`2607`)
        * Refactor ``can_stack_primitive_on_inputs`` (:pr:`2522`)
        * Update s3 bucket for docs image (:pr:`2593`)
        * Temporarily restrict pandas max version to ``<2.1.0`` and pyarrow to ``<13.0.0`` (:pr:`2609`)
        * Update for compatibility with pandas version ``2.1.0`` and remove pandas upper version restriction (:pr:`2616`)
    * Documentation Changes
        * Fix badge on README for tests (:pr:`2598`)
        * Update readthedocs config to use build.os (:pr:`2601`)
    * Testing Changes
        * Update airflow looking glass performance tests workflow (:pr:`2615`)
        * Removed old performance testing workflow (:pr:`2620`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`petejanuszewski1`, :user:`thehomebrewnerd`, :user:`tosemml`

v1.27.0 Jul 24, 2023
====================
    * Enhancements
        * Add support for Python 3.11 (:pr:`2583`)
        * Add support for ``pandas`` v2.0 (:pr:`2585`)
    * Changes
        * Remove natural language primitives add-on (:pr:`2570`)
        * Updates to address various warnings (:pr:`2589`)
    * Testing Changes
        * Run looking glass performance tests on merge via Airflow (:pr:`2575`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`petejanuszewski1`, :user:`sbadithe`, :user:`thehomebrewnerd`

v1.26.0 Apr 27, 2023
====================
    * Enhancements
        * Introduce New Single-Table DFS Algorithm (:pr:`2516`). This includes **experimental** functionality and is not officially supported.
        * Add premium primitives install command (:pr:`2545`)
    * Fixes
        * Fix Description of ``DaysInMonth`` (:pr:`2547`)
    * Changes
        * Make Dask an optional dependency (:pr:`2560`)

    Thanks to the following people for contributing to this release:
    :user:`dvreed77`, :user:`gsheni`, :user:`thehomebrewnerd`

Breaking Changes
++++++++++++++++
* Dask is now an optional dependency of Featuretools. Users that run ``calculate_feature_matrix`` with ``n_jobs`` set
  to anything other than 1, will now need to install Dask prior to running ``calculate_feature_matrix``. The required Dask
  dependencies can be installed with ``pip install "featuretools[dask]"``.

v1.25.0 Apr 13, 2023
====================
    * Enhancements
        * Add ``MaxCount``, ``MedianCount``, ``MaxMinDelta``, ``NUniqueDays``, ``NMostCommonFrequency``,
            ``NUniqueDaysOfCalendarYear``, ``NUniqueDaysOfMonth``, ``NUniqueMonths``,
            ``NUniqueWeeks``, ``IsFirstWeekOfMonth`` (:pr:`2533`)
        * Add ``HasNoDuplicates``, ``NthWeekOfMonth``, ``IsMonotonicallyDecreasing``, ``IsMonotonicallyIncreasing``,
            ``IsUnique`` (:pr:`2537`)
    * Fixes
        * Fix release notes header version (:pr:`2544`)
    * Changes
        * Restrict pandas to < 2.0.0 (:pr:`2533`)
        * Upgrade minimum pandas to 1.5.0 (:pr:`2537`)
        * Removed the ``Correlation`` and ``AutoCorrelation`` primitive as these could lead to data leakage (:pr:`2537`)
        * Remove IntegerNullable support for ``Kurtosis`` primitive  (:pr:`2537`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`

v1.24.0 Mar 28, 2023
====================
    * Enhancements
        * Add ``AverageCountPerUnique``, ``CountryCodeToContinent``, ``FileExtension``, ``FirstLastTimeDelta``, ``SavgolFilter``,
            ``CumulativeTimeSinceLastFalse``, ``CumulativeTimeSinceLastTrue``, ``PercentChange``, ``PercentUnique`` (:pr:`2485`)
        * Add ``FullNameToFirstName``, ``FullNameToLastName``, ``FullNameToTitle``, ``AutoCorrelation``,
            ``Correlation``, ``DateFirstEvent`` (:pr:`2507`)
        * Add ``Kurtosis``, ``MinCount``, ``NumFalseSinceLastTrue``, ``NumPeaks``,
            ``NumTrueSinceLastFalse``, ``NumZeroCrossings`` (:pr:`2514`)
    * Fixes
        * Pin github-action-check-linked-issues to 1.4.5 (:pr:`2497`)
        * Support Woodwork's update numeric inference (integers as strings) (:pr:`2505`)
        * Update ``SubtractNumeric`` Primitive with commutative class property (:pr:`2527`)
    * Changes
        * Separate Makefile command for core requirements, test requirements and dev requirements (:pr:`2518`)

    Thanks to the following people for contributing to this release:
    :user:`dvreed77`, :user:`gsheni`, :user:`ozzieD`

v1.23.0 Feb 15, 2023
====================
    * Changes
        * Change ``TotalWordLength`` and ``UpperCaseWordCount`` to return ``IntegerNullable`` (:pr:`2474`)
    * Testing Changes
       * Add GitHub Actions cache to speed up workflows (:pr:`2475`)
       * Fix latest dependency checker install command (:pr:`2476`)
       * Add pull request check for linked issues to CI workflow (:pr:`2477`, :pr:`2481`)
       * Remove make package from lint workflow (:pr:`2479`)

    Thanks to the following people for contributing to this release:
    :user:`dvreed77`, :user:`gsheni`, :user:`sbadithe`

v1.22.0 Jan 31, 2023
====================
    * Enhancements
        * Add ``AbsoluteDiff``, ``SameAsPrevious``, ``Variance``, ``Season``, ``UpperCaseWordCount`` transform primitives (:pr:`2460`)
    * Fixes
        * Fix bug with consecutive spaces in ``NumWords`` (:pr:`2459`)
        * Fix for compatibility with ``holidays`` v0.19.0 (:pr:`2471`)
    * Changes
        * Specify black and ruff config arguments in pre-commit-config (:pr:`2456`)
        * ``NumCharacters`` returns null given null input (:pr:`2463`)
    * Documentation Changes
        * Update ``release.md`` with instructions for launching Looking Glass performance test runs (:pr:`2461`)
        * Pin ``jupyter-client==7.4.9`` to fix broken documentation build (:pr:`2463`)
        * Unpin jupyter-client documentation requirement (:pr:`2468`)
    * Testing Changes
        * Add test suites for ``NumWords`` and ``NumCharacters`` primitives (:pr:`2459`, :pr:`2463`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd`

v1.21.0 Jan 18, 2023
====================
    * Enhancements
        * Add `get_recommended_primitives` function to featuretools (:pr:`2398`)
    * Changes
        * Update build_docs workflow to only run for Python 3.8 and Python 3.10 (:pr:`2447`)
    * Documentation Changes
        * Minor fix to release notes (:pr:`2444`)
    * Testing Changes
        * Add test that checks for Natural Language primitives timing out against edge-case input (:pr:`2429`)
        * Fix test compatibility with composeml 0.10 (:pr:`2439`)
        * Minimum dependency unit test jobs do not abort if one job fails (:pr:`2437`)
        * Run Looking Glass performance tests on merge to main (:pr:`2440`, :pr:`2441`)
        * Add ruff for linting and replace isort/flake8 (:pr:`2448`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd`

v1.20.0 Jan 5, 2023
===================
    * Enhancements
        * Add ``TimeSinceLastFalse``, ``TimeSinceLastMax``, ``TimeSinceLastMin``, and ``TimeSinceLastTrue`` primitives (:pr:`2418`)
        * Add ``MaxConsecutiveFalse``, ``MaxConsecutiveNegatives``, ``MaxConsecutivePositives``, ``MaxConsecutiveTrue``, ``MaxConsecutiveZeros``, ``NumConsecutiveGreaterMean``, ``NumConsecutiveLessMean`` (:pr:`2420`)
    * Fixes
        * Fix typo in ``_handle_binary_comparison`` function name and update ``set_feature_names`` docstring (:pr:`2388`)
        * Only allow Datetime time index as input to ``RateOfChange`` primitive (:pr:`2408`)
        * Prevent catastrophic backtracking in regex for ``NumberOfWordsInQuotes`` (:pr:`2413`)
        * Fix to eliminate fragmentation ``PerformanceWarning`` in ``feature_set_calculator.py`` (:pr:`2424`)
        * Fix serialization of ``NumberOfCommonWords`` feature with custom word_set (:pr:`2432`)
        * Improve edge case handling in NaturalLanguage primitives by standardizing delimiter regex (:pr:`2423`)
        * Remove support for ``Datetime`` and ``Ordinal`` inputs in several primitives to prevent creation of Features that cannot be calculated (:pr:`2434`)
    * Changes
        * Refactor ``_all_direct_and_same_path`` by deleting call to ``_features_have_same_path`` (:pr:`2400`)
        * Refactor ``_build_transform_features`` by iterating over ``input_features`` once (:pr:`2400`)
        * Iterate only once over ``ignore_columns`` in ``DeepFeatureSynthesis`` init (:pr:`2397`)
        * Resolve empty Pandas series warnings (:pr:`2403`)
        * Initialize Woodwork with ``init_with_partial_schama`` instead of ``init`` in ``EntitySet.add_last_time_indexes`` (:pr:`2409`)
        * Updates for compatibility with numpy 1.24.0 (:pr:`2414`)
        * The ``delimiter_regex`` parameter for ``TotalWordLength`` has been renamed to ``do_not_count`` (:pr:`2423`)
    * Documentation Changes
        *  Remove unused sections from 1.19.0 notes (:pr:`2396`)

   Thanks to the following people for contributing to this release:
   :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd`


Breaking Changes
++++++++++++++++
* The ``delimiter_regex`` parameter for ``TotalWordLength`` has been renamed to ``do_not_count``.
  Old saved features that had a non-default value for the parameter will no longer load.
* Support for ``Datetime`` and ``Ordinal`` inputs has been removed from the ``LessThanScalar``,
  ``GreaterThanScalar``, ``LessThanEqualToScalar`` and ``GreaterThanEqualToScalar`` primitives.

v1.19.0 Dec 9, 2022
===================
    * Enhancements
        * Add ``OneDigitPostalCode`` and ``TwoDigitPostalCode`` primitives (:pr:`2365`)
        * Add ``ExpandingCount``, ``ExpandingMin``, ``ExpandingMean``, ``ExpandingMax``, ``ExpandingSTD``, and ``ExpandingTrend`` primitives (:pr:`2343`)
    * Fixes
        * Fix DeepFeatureSynthesis to consider the ``base_of_exclude`` family of attributes when creating transform features(:pr:`2380`)
        * Fix bug with negative version numbers in ``test_version`` (:pr:`2389`)
        * Fix bug in ``MultiplyNumericBoolean`` primitive that can cause an error with certain input dtype combinations (:pr:`2393`)
    * Testing Changes
        * Fix version comparison in ``test_holiday_out_of_range`` (:pr:`2382`)

    Thanks to the following people for contributing to this release:
    :user:`sbadithe`, :user:`thehomebrewnerd`

v1.18.0 Nov 15, 2022
====================
    * Enhancements
        * Add ``RollingOutlierCount`` primitive (:pr:`2129`)
        * Add ``RateOfChange`` primitive (:pr:`2359`)
    * Fixes
        * Sets ``uses_full_dataframe`` for ``Rolling*`` and ``Exponential*`` primitives (:pr:`2354`)
        * Updates for compatibility with upcoming Woodwork release 0.21.0 (:pr:`2363`)
        * Updates demo dataset location to use new links (:pr:`2366`)
        * Fix ``test_holiday_out_of_range`` after ``holidays`` release 0.17 (:pr:`2373`)
    * Changes
        * Remove click and CLI functions (``list-primitives``, ``info``) (:pr:`2353`, :pr:`2358`)
    * Documentation Changes
        * Build docs in parallel with Sphinx (:pr:`2351`)
        * Use non-editable install to allow local docs build (:pr:`2367`)
        * Remove primitives.featurelabs.com website from documentation (:pr:`2369`)
    * Testing Changes
        * Replace use of pytest's tmpdir fixture with tmp_path (:pr:`2344`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`tamargrey`, :user:`thehomebrewnerd`

Breaking Changes
++++++++++++++++
* The featuretools CLI has been completely removed.

v1.17.0 Oct 31, 2022
====================
    * Enhancements
        * Add featuretools-sklearn-transformer as an extra installation option (:pr:`2335`)
        * Add CountAboveMean, CountBelowMean, CountGreaterThan, CountInsideNthSTD, CountInsideRange, CountLessThan, CountOutsideNthSTD, CountOutsideRange (:pr:`2336`)
    * Changes
        * Restructure primitives directory to use individual primitives files (:pr:`2331`)
        * Restrict 2022.10.1 for dask and distributed (:pr:`2347`)
    * Documentation Changes
        * Add Featuretools-SQL to Install page on documentation (:pr:`2337`)
        * Fixes broken link in Featuretools documentation (:pr:`2339`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd`

v1.16.0 Oct 24, 2022
====================
    * Enhancements
        * Add ExponentialWeighted primitives and DateToTimeZone primitive (:pr:`2318`)
        * Add 14 natural language primitives from ``nlp_primitives`` library (:pr:`2328`)
    * Documentation Changes
        * Fix typos in ``aggregation_primitive_base.py`` and ``features_deserializer.py`` (:pr:`2317`) (:pr:`2324`)
        * Update SQL integration documentation to reflect Snowflake compatibility (:pr:`2313`)
    * Testing Changes
        * Add Windows install test (:pr:`2330`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`sbadithe`, :user:`thehomebrewnerd`

v1.15.0 Oct 6, 2022
===================
    * Enhancements
        * Add ``series_library`` attribute to ``EntitySet`` dictionary (:pr:`2257`)
        * Leverage ``Library`` Enum inheriting from ``str`` (:pr:`2275`)
    * Changes
        * Change default gap for Rolling* primitives from 0 to 1 to prevent accidental leakage (:pr:`2282`)
        * Updates for pandas 1.5.0 compatibility (:pr:`2290`, :pr:`2291`, :pr:`2308`)
        * Exclude documentation files from release workflow (:pr:`2295`)
        * Bump requirements for optional pyspark dependency (:pr:`2299`)
        * Bump ``scipy`` and ``woodwork[spark]`` dependencies (:pr:`2306`)
    * Documentation Changes
        * Add documentation describing how to use ``featuretools_sql`` with ``featuretools`` (:pr:`2262`)
        * Remove ``featuretools_sql`` as a docs requirement (:pr:`2302`)
        * Fix typo in ``DiffDatetime`` doctest (:pr:`2314`)
        * Fix typo in ``EntitySet`` documentation (:pr:`2315`)
    * Testing Changes
        * Remove graphviz version restrictions in Windows CI tests (:pr:`2285`)
        * Run CI tests with ``pytest -n auto`` (:pr:`2298`, :pr:`2310`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd`

Breaking Changes
++++++++++++++++
* The ``EntitySet`` schema has been updated to include a ``series_library`` attribute
* The default behavior of the ``Rolling*`` primitives has changed in this release. If this primitive was used without
  defining the ``gap`` value, the feature values returned with this release will be different than feature values from
  prior releases.

v1.14.0 Sep 1, 2022
===================
    * Enhancements
        * Replace ``NumericLag`` with ``Lag`` primitive (:pr:`2252`)
        * Refactor build_features to speed up long running DFS calls by 50% (:pr:`2224`)
    * Fixes
        * Fix compatibility issues with holidays 0.15 (:pr:`2254`)
    * Changes
        * Update release notes to make clear conda release portion (:pr:`2249`)
        * Use pyproject.toml only (move away from setup.cfg) (:pr:`2260`, :pr:`2263`, :pr:`2265`)
        * Add entry point instructions for pyproject.toml project (:pr:`2272`)
    * Documentation Changes
        * Fix to remove warning from Using Spark EntitySets Guide (:pr:`2258`)
    * Testing Changes
        * Add tests/profiling/dfs_profile.py (:pr:`2224`)
        * Add workflow to test featuretools without test dependencies (:pr:`2274`)

    Thanks to the following people for contributing to this release:
    :user:`cp2boston`, :user:`gsheni`, :user:`ozzieD`, :user:`stefaniesmith`, :user:`thehomebrewnerd`

v1.13.0 Aug 18, 2022
====================
    * Fixes
        * Allow boolean columns to be included in remove_highly_correlated_features (:pr:`2231`)
    * Changes
        * Refactor schema version checking to use `packaging` method (:pr:`2230`)
        * Extract duplicated logic for Rolling primitives into a general utility function (:pr:`2218`)
        * Set pandas version to >=1.4.0 (:pr:`2246`)
        * Remove workaround in `roll_series_with_gap` caused by pandas version < 1.4.0 (:pr:`2246`)
    * Documentation Changes
        * Add line breaks between sections of IsFederalHoliday primitive docstring (:pr:`2235`)
    * Testing Changes
        * Update create feedstock PR forked repo to use (:pr:`2223`, :pr:`2237`)
        * Update development requirements and use latest for documentation (:pr:`2225`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`ozzieD`, :user:`sbadithe`, :user:`tamargrey`

v1.12.1 Aug 4, 2022
===================
    * Fixes
        * Update ``Trend`` and ``RollingTrend`` primitives to work with ``IntegerNullable`` inputs (:pr:`2204`)
        * ``camel_and_title_to_snake`` handles snake case strings with numbers (:pr:`2220`)
        * Change ``_get_description`` to split on blank lines to avoid truncating primitive descriptions (:pr:`2219`)
    * Documentation Changes
        * Add instructions to add new users to featuretools feedstock (:pr:`2215`)
    * Testing Changes
        * Add create feedstock PR workflow (:pr:`2181`)
        * Add performance tests for python 3.9 and 3.10 (:pr:`2198`, :pr:`2208`)
        * Add test to ensure primitive docstrings use standardized verbs (:pr:`2200`)
        * Configure codecov to avoid premature PR comments (:pr:`2209`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`tamargrey`, :user:`thehomebrewnerd`

v1.12.0 Jul 19, 2022
====================
    .. warning::
        This release of Featuretools will not support Python 3.7

    * Enhancements
        * Add ``IsWorkingHours`` and ``IsLunchTime`` transform primitives (:pr:`2130`)
        * Add periods parameter to ``Diff`` and add ``DiffDatetime`` primitive (:pr:`2155`)
        * Add ``RollingTrend`` primitive (:pr:`2170`)
    * Fixes
        * Resolves Woodwork integration test failure and removes Python version check for codecov (:pr:`2182`)
    * Changes
        * Drop Python 3.7 support (:pr:`2169`, :pr:`2186`)
        * Add pre-commit hooks for linting (:pr:`2177`)
    * Documentation Changes
        * Augment single table entry in DFS to include information about passing in a dictionary for `dataframes` argument (:pr:`2160`)
    * Testing Changes
        * Standardize imports across test files to simplify accessing featuretools functions (:pr:`2166`)
        * Split spark tests into multiple CI jobs to speed up runtime (:pr:`2183`)

    Thanks to the following people for contributing to this release:
    :user:`dvreed77`, :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`sbadithe`

v1.11.1 Jul 5, 2022
===================
    * Fixes
        * Remove 24th hour from PartOfDay primitive and add 0th hour (:pr:`2167`)

    Thanks to the following people for contributing to this release:
    :user:`tamargrey`

v1.11.0 Jun 30, 2022
====================
    * Enhancements
        * Add datetime and string types as valid arguments to dfs ``cutoff_time`` (:pr:`2147`)
        * Add ``PartOfDay`` transform primitive (:pr:`2128`)
        * Add ``IsYearEnd``, ``IsYearStart`` transform primitives (:pr:`2124`)
        * Add ``Feature.set_feature_names`` method to directly set output column names for multi-output features (:pr:`2142`)
        * Include np.nan testing for ``DayOfYear`` and ``DaysInMonth`` primitives (:pr:`2146`)
        * Allow dfs kwargs to be passed into ``get_valid_primitives`` (:pr:`2157`)
    * Changes
        * Improve serialization and deserialization to reduce storage of duplicate primitive information (:pr:`2136`, :pr:`2127`, :pr:`2144`)
        * Sort core requirements and test requirements in setup cfg (:pr:`2152`)
    * Testing Changes
        * Fix pandas warning and reduce dask .apply warnings (:pr:`2145`)
        * Pin graphviz version used in windows tests (:pr:`2159`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`sbadithe`, :user:`tamargrey`, :user:`thehomebrewnerd`

v1.10.0 Jun 23, 2022
====================
    * Enhancements
        * Add ``DayOfYear``, ``DaysInMonth``, ``Quarter``, ``IsLeapYear``, ``IsQuarterEnd``, ``IsQuarterStart`` transform primitives (:pr:`2110`, :pr:`2117`)
        * Add ``IsMonthEnd``, ``IsMonthStart`` transform primitives (:pr:`2121`)
        * Move ``Quarter`` test cases (:pr:`2123`)
        * Add ``summarize_primitives`` function for getting metrics about available primitives (:pr:`2099`)
    * Changes
        * Changes for compatibility with numpy 1.23.0 (:pr:`2135`, :pr:`2137`)
    * Documentation Changes
        * Update contributing.md to add pandoc (:pr:`2103`, :pr:`2104`)
        * Update NLP primitives section of API reference (:pr:`2109`)
        * Fixing release notes formatting (:pr:`2139`)
    * Testing Changes
        * Latest dependency checker installs spark dependencies (:pr:`2112`)
        * Fix test failures with pyspark v3.3.0 (:pr:`2114`, :pr:`2120`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd`

v1.9.2 Jun 10, 2022
===================
    * Fixes
        * Add feature origin information to all multi-output feature columns (:pr:`2102`)
    * Documentation Changes
        * Update contributing.md to add pandoc (:pr:`2103`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`thehomebrewnerd`

v1.9.1 May 27, 2022
===================
    * Enhancements
        * Update ``DateToHoliday`` and ``DistanceToHoliday`` primitives to work with timezone-aware inputs (:pr:`2056`)
    * Changes
        * Delete setup.py, MANIFEST.in and move configuration to pyproject.toml (:pr:`2046`)
    * Documentation Changes
        * Update slack invite link to new (:pr:`2044`)
        * Add slack and stackoverflow icon to footer (:pr:`2087`)
        * Update dead links in docs and docstrings (:pr:`2092`, :pr:`2095`)
    * Testing Changes
        * Skip test for ``normalize_dataframe`` due to different error coming from Woodwork in 0.16.3 (:pr:`2052`)
        * Fix Woodwork install in test with Woodwork main branch (:pr:`2055`)
        * Use codecov action v3 (:pr:`2039`)
        * Add workflow to kickoff EvalML unit tests with Featuretools main (:pr:`2072`)
        * Rename yml to yaml for GitHub Actions workflows (:pr:`2073`, :pr:`2077`)
        * Update Dask test fixtures to prevent flaky behavior (:pr:`2079`)
        * Update Makefile with better pkg command (:pr:`2081`)
        * Add scheduled workflow that checks for broken links in documentation (:pr:`2084`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd`

v1.9.0 Apr 27, 2022
===================
    * Enhancements
        * Improve ``UnusedPrimitiveWarning`` with additional information (:pr:`2003`)
        * Update DFS primitive matching to use all inputs defined in primitive ``input_types`` (:pr:`2019`)
        * Add ``MultiplyNumericBoolean`` primitive (:pr:`2035`)
    * Fixes
        * Fix issue with Ordinal inputs to binary comparison primitives (:pr:`2024`, :pr:`2025`)
    * Changes
        * Updated autonormalize version requirement (:pr:`2002`)
        * Remove extra NaN checking in LatLong primitives (:pr:`1924`)
        * Normalize LatLong NaN values during EntitySet creation (:pr:`1924`)
        * Pass primitive dictionaries into ``check_primitive`` to avoid repetitive calls (:pr:`2016`)
        * Remove ``Boolean`` and ``BooleanNullable`` from ``MultiplyNumeric`` primitive inputs (:pr:`2022`)
        * Update serialization for compatibility with Woodwork version 0.16.1 (:pr:`2030`)
    * Documentation Changes
        * Update README text to Alteryx (:pr:`2010`, :pr:`2015`)
    * Testing Changes
        * Update unit tests with Woodwork main branch workflow name (:pr:`2033`)
        * Add slack alert for failing unit tests with Woodwork main branch (:pr:`2040`)

    Thanks to the following people for contributing to this release:
    :user:`dvreed77`, :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`thehomebrewnerd`

Note
++++
* The update to the DFS algorithm in this release may cause the number of features returned
  by ``ft.dfs`` to increase in some cases.

v1.8.0 Mar 31, 2022
===================
    * Changes
        * Removed ``make_trans_primitive`` and ``make_agg_primitive`` utility functions (:pr:`1970`)
    * Documentation Changes
        * Update project urls in setup cfg to include Twitter and Slack (:pr:`1981`)
        * Update nbconvert to version 6.4.5 to fix docs build issue (:pr:`1984`)
        * Update ReadMe to have centered badges and add docs badge (:pr:`1993`)
        * Add M1 installation instructions to docs and contributing (:pr:`1997`)
    * Testing Changes
        * Updated scheduled workflows to only run on Alteryx owned repos (:pr:`1973`)
        * Updated minimum dependency checker to use new version with write file support (:pr:`1975`, :pr:`1976`)
        * Add black linting package and remove autopep8 (:pr:`1978`)
        * Update tests for compatibility with Woodwork version 0.15.0 (:pr:`1984`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`thehomebrewnerd`

Breaking Changes
++++++++++++++++
* The utility functions ``make_trans_primitive`` and ``make_agg_primitive`` have been removed. To create custom
  primitives, define the primitive class directly.

v1.7.0 Mar 16, 2022
===================
    * Enhancements
        * Add support for Python 3.10 (:pr:`1940`)
        * Added the SquareRoot, NaturalLogarithm, Sine, Cosine and Tangent primitives (:pr:`1948`)
    * Fixes
        * Updated the conda install commands to specify the channel (:pr:`1917`)
    * Changes
        * Update error message when DFS returns an empty list of features (:pr:`1919`)
        * Remove ``list_variable_types`` and related directories (:pr:`1929`)
        * Transition to use pyproject.toml and setup.cfg (moving away from setup.py) (:pr:`1941`, :pr:`1950`, :pr:`1952`, :pr:`1954`, :pr:`1957`, :pr:`1964`)
        * Replace Koalas with pandas API on Spark (:pr:`1949`)
    * Documentation Changes
        * Add time series guide (:pr:`1896`)
        * Update minimum nlp_primitives requirement for docs (:pr:`1925`)
        * Add GitHub URL for PyPi (:pr:`1928`)
        * Add backport release support (:pr:`1932`)
        * Update instructions in ``release.md`` (:pr:`1963`)
    * Testing Changes
        * Update test cases to cover __main__.py file (:pr:`1927`)
        * Upgrade moto requirement (:pr:`1929`, :pr:`1938`)
        * Add Python 3.9 linting, install complete, and docs build CI tests (:pr:`1934`)
        * Add CI workflow to test with latest woodwork main branch (:pr:`1936`)
        * Add lower bound for wheel for minimum dependency checker and limit lint CI tests to Python 3.10 (:pr:`1945`)
        * Fix non-deterministic test in ``test_es.py`` (:pr:`1961`)

    Thanks to the following people for contributing to this release:
    :user:`andriyor`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`kushal-gopal`, :user:`mingdavidqi`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`tvdboom`

Breaking Changes
++++++++++++++++
* The deprecated utility ``list_variable_types`` has been removed from Featuretools.

v1.6.0 Feb 17, 2022
===================
    * Enhancements
        * Add ``IsFederalHoliday`` transform primitive (:pr:`1912`)
    * Fixes
        * Fix to catch new ``NotImplementedError`` raised by ``holidays`` library for unknown country (:pr:`1907`)
    * Changes
        * Remove outdated pandas workaround code (:pr:`1906`)
    * Documentation Changes
        * Add in-line tabs and copy-paste functionality to docs (:pr:`1905`)
    * Testing Changes
        * Fix URL deserialization file (:pr:`1909`)

    Thanks to the following people for contributing to this release:
    :user:`jeff-hernandez`, :user:`rwedge`, :user:`thehomebrewnerd`


v1.5.0 Feb 14, 2022
===================
    .. warning::
        Featuretools may not support Python 3.7 in next non-bugfix release.

    * Enhancements
        * Add ability to use offset alias strings as inputs to rolling primitives (:pr:`1809`)
        * Update to add support for pandas version 1.4.0 (:pr:`1881`, :pr:`1895`)
    * Fixes
        * Fix ``featuretools_primitives`` entry point (:pr:`1891`)
    * Changes
        * Allow only snake camel and title case for primitives (:pr:`1854`)
        * Add autonormalize as an add-on library (:pr:`1840`)
        * Add DateToHoliday Transform Primitive (:pr:`1848`)
        * Add DistanceToHoliday Transform Primitive (:pr:`1853`)
        * Temporarily restrict pandas and koalas max versions (:pr:`1863`)
        * Add ``__setitem__`` method to overload ``add_dataframe`` method on EntitySet (:pr:`1862`)
        * Add support for woodwork 0.12.0 (:pr:`1872`, :pr:`1897`)
        * Split Datetime and LatLong primitives into separate files (:pr:`1861`)
        * Null values will not be included in index of normalized dataframe (:pr:`1897`)
    * Documentation Changes
        * Bump ipython version (:pr:`1857`)
        * Update README.md with Alteryx link (:pr:`1886`)
    * Testing Changes
        * Add check for package conflicts with install workflow (:pr:`1843`)
        * Change auto approve workflow to use assignee (:pr:`1843`)
        * Update auto approve workflow to delete branch and change on trigger (:pr:`1852`)
        * Upgrade tests to use compose version 0.8.0 (:pr:`1856`)
        * Updated deep feature synthesis and feature serialization tests to use new primitive files (:pr:`1861`)

    Thanks to the following people for contributing to this release:
    :user:`dvreed77`, :user:`gsheni`, :user:`jacobboney`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`tuethan1999`

Breaking Changes
++++++++++++++++
* When using ``normalize_dataframe`` to create a new dataframe, the new dataframe's index will not include a null value.

v1.4.0 Jan 10, 2022
===================
    * Enhancements
        * Add LatLong transform primitives - GeoMidpoint, IsInGeoBox, CityblockDistance (:pr:`1814`)
        * Add issue templates for bugs, feature requests and documentation improvements (:pr:`1834`)
    * Fixes
        * Fix bug where Woodwork initialization could fail on feature matrix if cutoff times caused null values to be introduced (:pr:`1810`)
    * Changes
        * Skip code coverage for specific dask usage lines (:pr:`1829`)
        * Increase minimum required numpy version to 1.21.0, scipy to 1.3.3, koalas to 1.8.1 (:pr:`1833`)
        * Remove pyyaml as a requirement (:pr:`1833`)
    * Documentation Changes
        * Remove testing on conda forge in release.md (:pr:`1811`)
    * Testing Changes
        * Enable auto-merge for minimum and latest dependency merge requests (:pr:`1818`, :pr:`1821`, :pr:`1822`)
        * Change auto approve workfow to use PR number and run every 30 minutes (:pr:`1827`)
        * Add auto approve workflow to run when unit tests complete (:pr:`1837`)
        * Test deserializing from S3 with mocked S3 fixtures only (:pr:`1825`)
        * Remove fastparquet as a test requirement (:pr:`1833`)

    Thanks to the following people for contributing to this release:
    :user:`davesque`, :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd`

v1.3.0 Dec 2, 2021
==================
    * Enhancements
        * Add ``NumericLag`` transform primitive (:pr:`1797`)
    * Changes
        * Update pip to 21.3.1 for test requirements (:pr:`1789`)
    * Documentation Changes
        * Add Docker install instructions and documentation on the install page. (:pr:`1785`)
        * Update install page on documentation with correct python version (:pr:`1784`)
        * Fix formatting in Improving Computational Performance guide (:pr:`1786`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`HenryRocha`, :user:`tamargrey` :user:`thehomebrewnerd`

v1.2.0 Nov 15, 2021
===================
    * Enhancements
        * Add Rolling Transform primitives with integer parameters (:pr:`1770`)
    * Fixes
        * Handle new graphviz FORMATS import (:pr:`1770`)
    * Changes
        * Add new version of featuretools_tsfresh_primitives as an add-on library (:pr:`1772`)
        * Add ``load_weather`` as demo dataset for time series :pr:`1777`

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`tamargrey`

v1.1.0 Nov 2, 2021
==================
    * Fixes
        * Check ``base_of_exclude`` attribute on primitive instead feature class (:pr:`1749`)
        * Pin upper bound for pyspark (:pr:`1748`)
        * Fix ``get_unused_primitives`` only recognizes lowercase primitive strings (:pr:`1733`)
        * Require newer versions of dask and distributed (:pr:`1762`)
        * Fix bug with pass-through columns of cutoff_time df when n_jobs > 1 (:pr:`1765`)
    * Changes
        * Add new version of nlp_primitives as an add-on library (:pr:`1743`)
        * Change name of date_of_birth (column name) to birthday in mock dataset (:pr:`1754`)
    * Documentation Changes
        * Upgrade Sphinx and fix docs configuration error (:pr:`1760`)
    * Testing Changes
        * Modify CI to run unit test with latest dependencies on python 3.9 (:pr:`1738`)
        * Added Python version standardizer to Jupyter notebook linting (:pr:`1741`)

    Thanks to the following people for contributing to this release:
    :user:`bchen1116`, :user:`gsheni`, :user:`HenryRocha`, :user:`jeff-hernandez`, :user:`ridicolos`, :user:`rwedge`

v1.0.0 Oct 12, 2021
===================
    * Enhancements
        * Add support for creating EntitySets from Woodwork DataTables (:pr:`1277`)
        * Add ``EntitySet.__deepcopy__`` that retains Woodwork typing information (:pr:`1465`)
        * Add ``EntitySet.__getstate__`` and ``EntitySet.__setstate__`` to preserve typing when pickling (:pr:`1581`)
        * Returned feature matrix has woodwork typing information (:pr:`1664`)
    * Fixes
        * Fix ``DFSTransformer`` Documentation for Featuretools 1.0 (:pr:`1605`)
        * Fix ``calculate_feature_matrix`` time type check and ``encode_features`` for synthesis tests (:pr:`1580`)
        * Revert reordering of categories in ``Equal`` and ``NotEqual`` primitives (:pr:`1640`)
        * Fix bug in ``EntitySet.add_relationship`` that caused ``foreign_key`` tag to be lost (:pr:`1675`)
        * Update DFS to not build features on last time index columns in dataframes (:pr:`1695`)
    * Changes
        * Remove ``add_interesting_values`` from ``Entity`` (:pr:`1269`)
        * Move ``set_secondary_time_index`` method from ``Entity`` to ``EntitySet`` (:pr:`1280`)
        * Refactor Relationship creation process (:pr:`1370`)
        * Replaced ``Entity.update_data`` with ``EntitySet.update_dataframe`` (:pr:`1398`)
        * Move validation check for uniform time index to ``EntitySet`` (:pr:`1400`)
        * Replace ``Entity`` objects in ``EntitySet`` with Woodwork dataframes (:pr:`1405`)
        * Refactor ``EntitySet.plot`` to work with Woodwork dataframes (:pr:`1468`)
        * Move ``last_time_index`` to be a column on the DataFrame (:pr:`1456`)
        * Update serialization/deserialization to work with Woodwork (:pr:`1452`)
        * Refactor ``EntitySet.query_by_values`` to work with Woodwork dataframes (:pr:`1467`)
        * Replace ``list_variable_types`` with ``list_logical_types`` (:pr:`1477`)
        * Allow deep EntitySet equality check (:pr:`1480`)
        * Update ``EntitySet.concat`` to work with Woodwork DataFrames (:pr:`1490`)
        * Add function to list semantic tags (:pr:`1486`)
        * Initialize Woodwork on feature matrix in ``remove_highly_correlated_features`` if necessary (:pr:`1618`)
        * Remove categorical-encoding as an add-on library (will be added back later) (:pr:`1632`)
        * Remove autonormalize as an add-on library (will be added back later) (:pr:`1636`)
        * Remove tsfresh, nlp_primitives, sklearn_transformer as an add-on library (will be added back later) (:pr:`1638`)
        * Update input and return types for ``CumCount`` primitive (:pr:`1651`)
        * Standardize imports of Woodwork (:pr:`1526`)
        * Rename target entity to target dataframe (:pr:`1506`)
        * Replace ``entity_from_dataframe`` with ``add_dataframe`` (:pr:`1504`)
        * Create features from Woodwork columns (:pr:`1582`)
        * Move default variable description logic to ``generate_description`` (:pr:`1403`)
        * Update Woodwork to version 0.4.0 with ``LogicalType.transform`` and LogicalType instances (:pr:`1451`)
        * Update Woodwork to version 0.4.1 with Ordinal order values and whitespace serialization fix (:pr:`1478`)
        * Use ``ColumnSchema`` for primitive input and return types (:pr:`1411`)
        * Update features to use Woodwork and remove ``Entity`` and ``Variable`` classes (:pr:`1501`)
        * Re-add ``make_index`` functionality to EntitySet (:pr:`1507`)
        * Use ``ColumnSchema`` in DFS primitive matching (:pr:`1523`)
        * Updates from Featuretools v0.26.0 (:pr:`1539`)
        * Leverage Woodwork better in ``add_interesting_values`` (:pr:`1550`)
        * Update ``calculate_feature_matrix`` to use Woodwork (:pr:`1533`)
        * Update Woodwork to version 0.6.0 with changed categorical inference (:pr:`1597`)
        * Update ``nlp-primitives`` requirement for Featuretools 1.0 (:pr:`1609`)
        * Remove remaining references to ``Entity`` and ``Variable`` in code (:pr:`1612`)
        * Update Woodwork to version 0.7.1 with changed initialization (:pr:`1648`)
        * Removes outdated workaround code related to a since-resolved pandas issue (:pr:`1677`)
        * Remove unused ``_dataframes_equal`` and ``camel_to_snake`` functions (:pr:`1683`)
        * Update Woodwork to version 0.8.0 for improved performance (:pr:`1689`)
        * Remove redundant typecasting in ``encode_features`` (:pr:`1694`)
        * Speed up ``encode_features`` if not inplace, some space cost (:pr:`1699`)
        * Clean up comments and commented out code (:pr:`1701`)
        * Update Woodwork to version 0.8.1 for improved performance (:pr:`1702`)
    * Documentation Changes
        * Add a Woodwork Typing in Featuretools guide (:pr:`1589`)
        * Add a resource guide for transitioning to Featuretools 1.0 (:pr:`1627`)
        * Update ``using_entitysets`` page to use Woodwork (:pr:`1532`)
        * Update FAQ page to use Woodwork integration (:pr:`1649`)
        * Update DFS page to be Jupyter notebook and use Woodwork integration (:pr:`1557`)
        * Update Feature Primitives page to be Jupyter notebook and use Woodwork integration (:pr:`1556`)
        * Update Handling Time page to be Jupyter notebook and use Woodwork integration (:pr:`1552`)
        * Update Advanced Custom Primitives page to be Jupyter notebook and use Woodwork integration (:pr:`1587`)
        * Update Deployment page to use Woodwork integration (:pr:`1588`)
        * Update Using Dask EntitySets page to be Jupyter notebook and use Woodwork integration (:pr:`1590`)
        * Update Specifying Primitive Options page to be Jupyter notebook and use Woodwork integration (:pr:`1593`)
        * Update API Reference to match Featuretools 1.0 API (:pr:`1600`)
        * Update Index page to be Jupyter notebook and use Woodwork integration (:pr:`1602`)
        * Update Feature Descriptions page to be Jupyter notebook and use Woodwork integration (:pr:`1603`)
        * Update Using Koalas EntitySets page to be Jupyter notebook and use Woodwork integration (:pr:`1604`)
        * Update Glossary to use Woodwork integration (:pr:`1608`)
        * Update Tuning DFS page to be Jupyter notebook and use Woodwork integration (:pr:`1610`)
        * Fix small formatting issues in Documentation (:pr:`1607`)
        * Remove Variables page and more references to variables (:pr:`1629`)
        * Update Feature Selection page to use Woodwork integration (:pr:`1618`)
        * Update Improving Performance page to be Jupyter notebook and use Woodwork integration (:pr:`1591`)
        * Fix typos in transition guide (:pr:`1672`)
        * Update installation instructions for 1.0.0rc1 announcement in docs (:pr:`1707`, :pr:`1708`, :pr:`1713`, :pr:`1716`)
        * Fixed broken link for Demo notebook in README.md (:pr:`1728`)
        * Update ``contributing.md`` to improve instructions for external contributors (:pr:`1723`)
        * Manually revert changes made by :pr:`1677` and :pr:`1679`.  The related bug in pandas still exists. (:pr:`1731`)
    * Testing Changes
        * Remove entity tests (:pr:`1521`)
        * Fix broken ``EntitySet`` tests (:pr:`1548`)
        * Fix broken primitive tests (:pr:`1568`)
        * Added Jupyter notebook cleaner to the linters (:pr:`1719`)
        * Update reviewers for minimum and latest dependency checkers (:pr:`1715`)
        * Full coverage for EntitySet.__eq__ method (:pr:`1725`)
        * Add tests to verify all primitives can be initialized without parameter values (:pr:`1726`)

    Thanks to the following people for contributing to this release:
    :user:`bchen1116`, :user:`gsheni`, :user:`HenryRocha`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`VaishnaviNandakumar`

Breaking Changes
++++++++++++++++

* ``Entity.add_interesting_values`` has been removed. To add interesting values for a single
  entity, call ``EntitySet.add_interesting_values`` and pass the name of the dataframe for
  which to add interesting values in the ``dataframe_name`` parameter (:pr:`1405`, :pr:`1370`).
* ``Entity.set_secondary_time_index`` has been removed and replaced by ``EntitySet.set_secondary_time_index``
  with an added ``dataframe_name`` parameter to specify the dataframe on which to set the secondary time index (:pr:`1405`, :pr:`1370`).
* ``Relationship`` initialization has been updated to accept four name values for the parent dataframe,
  parent column, child dataframe and child column instead of accepting two ``Variable`` objects  (:pr:`1405`, :pr:`1370`).
* ``EntitySet.add_relationship`` has been updated to accept dataframe and column name values or a
  ``Relationship`` object. Adding a relationship from a ``Relationship`` object now requires passing
  the relationship as a keyword argument  (:pr:`1405`, :pr:`1370`).
* ``Entity.update_data`` has been removed. To update the dataframe, call ``EntitySet.replace_dataframe`` and use the ``dataframe_name`` parameter (:pr:`1630`, :pr:`1522`).
* The data in an ``EntitySet`` is no longer stored in ``Entity`` objects. Instead, dataframes
  with Woodwork typing information are used. Accordingly, most language referring to “entities”
  will now refer to “dataframes”, references to “variables” will now refer to “columns”, and
  “variable types” will use the Woodwork type system’s “logical types” and “semantic tags” (:pr:`1405`).
* The dictionary of tuples passed to ``EntitySet.__init__`` has replaced the ``variable_types`` element
  with separate ``logical_types`` and ``semantic_tags`` dictionaries (:pr:`1405`).
* ``EntitySet.entity_from_dataframe`` no longer exists. To add new tables to an entityset, use``EntitySet.add_dataframe`` (:pr:`1405`).
* ``EntitySet.normalize_entity`` has been renamed to ``EntitySet.normalize_dataframe`` (:pr:`1405`).
* Instead of raising an error at ``EntitySet.add_relationship`` when the dtypes of parent and child columns
  do not match, Featuretools will now check whether the Woodwork logical type of the parent and child columns
  match. If they do not match, there will now be a warning raised, and Featuretools will attempt to update
  the logical type of the child column to match the parent’s (:pr:`1405`).
* If no index is specified at ``EntitySet.add_dataframe``, the first column will only be used as index if
  Woodwork has not been initialized on the DataFrame. When adding a dataframe that already has Woodwork
  initialized, if there is no index set, an error will be raised (:pr:`1405`).
* Featuretools will no longer re-order columns in DataFrames so that the index column is the first column of the DataFrame (:pr:`1405`).
* Type inference can now be performed on Dask and Koalas dataframes, though a warning will be issued
  indicating that this may be computationally intensive (:pr:`1405`).
* EntitySet.time_type is no longer stored as Variable objects. Instead, Woodwork typing is used, and a
  numeric time type will be indicated by the ``'numeric'`` semantic tag string, and a datetime time type
  will be indicated by the ``Datetime`` logical type (:pr:`1405`).
* ``last_time_index``, ``secondary_time_index``, and ``interesting_values`` are no longer attributes
  of an entityset’s tables that can be accessed directly. Now they must be accessed through the metadata
  of the Woodwork DataFrame, which is a dictionary (:pr:`1405`).
* The helper function ``list_variable_types`` will be removed in a future release and replaced by ``list_logical_types``.
  In the meantime, ``list_variable_types`` will return the same output as ``list_logical_types`` (:pr:`1447`).

What's New in this Release
++++++++++++++++++++++++++

**Adding Interesting Values**

To add interesting values for a single entity, call ``EntitySet.add_interesting_values`` passing the
id of the dataframe for which interesting values should be added.

.. code-block:: python

    >>> es.add_interesting_values(dataframe_name='log')

**Setting a Secondary Time Index**

To set a secondary time index for a specific dataframe, call ``EntitySet.set_secondary_time_index`` passing
the dataframe name for which to set the secondary time index along with the dictionary mapping the secondary time
index column to the for which the secondary time index applies.

.. code-block:: python

    >>> customers_secondary_time_index = {'cancel_date': ['cancel_reason']}
    >>> es.set_secondary_time_index(dataframe_name='customers', customers_secondary_time_index)

**Creating a Relationship and Adding to an EntitySet**

Relationships are now created by passing parameters identifying the entityset along with four string values
specifying the parent dataframe, parent column, child dataframe and child column. Specifying parameter names
is optional.

.. code-block:: python

    >>> new_relationship = Relationship(
    ...     entityset=es,
    ...     parent_dataframe_name='customers',
    ...     parent_column_name='id',
    ...     child_dataframe_name='sessions',
    ...     child_column_name='customer_id'
    ... )

Relationships can now be added to EntitySets in one of two ways. The first approach is to pass in
name values for the parent dataframe, parent column, child dataframe and child column. Specifying
parameter names is optional with this approach.

.. code-block:: python

    >>> es.add_relationship(
    ...     parent_dataframe_name='customers',
    ...     parent_column_name='id',
    ...     child_dataframe_name='sessions',
    ...     child_column_name='customer_id'
    ... )

Relationships can also be added by passing in a previously created ``Relationship`` object. When using
this approach the ``relationship`` parameter name must be included.

.. code-block:: python

    >>> es.add_relationship(relationship=new_relationship)

**Replace DataFrame**

To replace a dataframe in an EntitySet with a new dataframe, call ``EntitySet.replace_dataframe`` and pass in the name of the dataframe to replace along with the new data.

.. code-block:: python

    >>> es.replace_dataframe(dataframe_name='log', df=df)

**List Logical Types and Semantic Tags**

Logical types and semantic tags have replaced variable types to parse and interpret columns. You can list all the available logical types by calling ``featuretools.list_logical_types``.

.. code-block:: python

    >>> ft.list_logical_types()

You can list all the available semantic tags by calling ``featuretools.list_semantic_tags``.

.. code-block:: python

    >>> ft.list_semantic_tags()

v0.27.1 Sep 2, 2021
===================
    * Documentation Changes
        * Add banner to docs about upcoming Featuretools 1.0 release (:pr:`1669`)

    Thanks to the following people for contributing to this release:
    :user:`thehomebrewnerd`

v0.27.0 Aug 31, 2021
====================
    * Changes
        * Remove autonormalize, tsfresh, nlp_primitives, sklearn_transformer, caegorical_encoding as an add-on libraries (will be added back later) (:pr:`1644`)
        * Emit a warning message when a ``featuretools_primitives`` entrypoint
          throws an exception (:pr:`1662`)
        * Throw a ``RuntimeError`` when two primitives with the same name are
          encountered during ``featuretools_primitives`` entrypoint handling
          (:pr:`1662`)
        * Prevent the ``featuretools_primitives`` entrypoint loader from
          loading non-class objects as well as the ``AggregationPrimitive`` and
          ``TransformPrimitive`` base classes (:pr:`1662`)
    * Testing Changes
        * Update latest dependency checker with proper install command (:pr:`1652`)
        * Update isort dependency (:pr:`1654`)

    Thanks to the following people for contributing to this release:
    :user:`davesque`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`

v0.26.2 Aug 17, 2021
====================
    * Documentation Changes
        * Specify conda channel and Windows exe in graphviz installation instructions (:pr:`1611`)
        * Remove GA token from the layout html (:pr:`1622`)
    * Testing Changes
        * Add additional reviewers to minimum and latest dependency checkers (:pr:`1558`, :pr:`1562`, :pr:`1564`, :pr:`1567`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`simha104`

v0.26.1 Jul 23, 2021
====================
    * Fixes
        * Set ``name`` attribute for ``EmailAddressToDomain`` primitive (:pr:`1543`)
    * Documentation Changes
        * Remove and ignore unnecessary graph files (:pr:`1544`)

    Thanks to the following people for contributing to this release:
    :user:`davesque`, :user:`rwedge`

v0.26.0 Jul 15, 2021
====================
    * Enhancements
        * Add ``replace_inf_values`` utility function for replacing ``inf`` values in a feature matrix (:pr:`1505`)
        * Add URLToProtocol, URLToDomain, URLToTLD, EmailAddressToDomain, IsFreeEmailDomain as transform primitives (:pr:`1508`, :pr:`1531`)
    * Fixes
        * ``include_entities`` correctly overrides ``exclude_entities`` in ``primitive_options`` (:pr:`1518`)
    * Documentation Changes
        * Prevent logging on build (:pr:`1498`)
    * Testing Changes
        * Test featuretools on pandas 1.3.0 release candidate and make fixes (:pr:`1492`)

    Thanks to the following people for contributing to this release:
    :user:`frances-h`, :user:`gsheni`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`tuethan1999`

v0.25.0 Jun 11, 2021
====================
    * Enhancements
       * Add ``get_valid_primitives`` function (:pr:`1462`)
       * Add ``EntitySet.dataframe_type`` attribute (:pr:`1473`)
    * Changes
        * Upgrade minimum alteryx open source update checker to 2.0.0 (:pr:`1460`)
    * Testing Changes
        * Upgrade minimum pip requirement for testing to 21.1.2 (:pr:`1475`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`rwedge`

v0.24.1 May 26, 2021
====================
    * Fixes
        * Update minimum pyyaml requirement to 5.4 (:pr:`1433`)
        * Update minimum psutil requirement to 5.6.6 (:pr:`1438`)
    * Documentation Changes
        * Update nbsphinx version to fix docs build issue (:pr:`1436`)
    * Testing Changes
        * Create separate worksflows for each CI job (:pr:`1422`)
        * Add minimum dependency checker to generate minimum requirement files (:pr:`1428`)
        * Add unit tests against minimum dependencies for python 3.7 on PRs and main (:pr:`1432`, :pr:`1445`)
        * Update minimum urllib3 requirement to 1.26.5 (:pr:`1457`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`thehomebrewnerd`

v0.24.0 Apr 30, 2021
====================
    * Changes
        * Add auto assign bot on GitHub (:pr:`1380`)
        * Reduce DFS max_depth to 1 if single entity in entityset (:pr:`1412`)
        * Drop Python 3.6 support (:pr:`1413`)
    * Documentation Changes
        * Improve formatting of release notes (:pr:`1396`)
    * Testing Changes
        * Update Dask/Koalas test fixtures (:pr:`1382`)
        * Update Spark config in test fixtures and docs (:pr:`1387`, :pr:`1389`)
        * Don't cancel other CI jobs if one fails (:pr:`1386`)
        * Update boto3 and urllib3 version requirements (:pr:`1394`)
        * Update token for dependency checker PR creation (:pr:`1402`, :pr:`1407`, :pr:`1409`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`

v0.23.3 Mar 31, 2021
====================
    .. warning::
        The next non-bugfix release of Featuretools will not support Python 3.6

    * Changes
        * Minor updates to work with Koalas version 1.7.0 (:pr:`1351`)
        * Explicitly mention Python 3.8 support in setup.py classifiers (:pr:`1371`)
        * Fix issue with smart-open version 5.0.0 (:pr:`1372`, :pr:`1376`)
    * Testing Changes
        * Make release notes updated check separate from unit tests (:pr:`1347`)
        * Performance tests now specify which commit to check (:pr:`1354`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd`

v0.23.2 Feb 26, 2021
====================
    .. warning::
        The next non-bugfix release of Featuretools will not support Python 3.6

    * Enhancements
        * The ``list_primitives`` function returns valid input types and the return type (:pr:`1341`)
    * Fixes
        * Restrict numpy version when installing koalas (:pr:`1329`)
    * Changes
        * Warn python 3.6 users support will be dropped in future release (:pr:`1344`)
    * Documentation Changes
        * Update docs for defining custom primitives (:pr:`1332`)
        * Update featuretools release instructions (:pr:`1345`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`

v0.23.1 Jan 29, 2021
====================
    * Fixes
        * Calculate direct features uses default value if parent missing (:pr:`1312`)
        * Fix bug and improve tests for ``EntitySet.__eq__`` and ``Entity.__eq__`` (:pr:`1323`)
    * Documentation Changes
        * Update Twitter link to documentation toolbar (:pr:`1322`)
    * Testing Changes
        * Unpin python-graphviz package on Windows (:pr:`1296`)
        * Reorganize and clean up tests (:pr:`1294`, :pr:`1303`, :pr:`1306`)
        * Trigger tests on pull request events (:pr:`1304`, :pr:`1315`)
        * Remove unnecessary test skips on Windows (:pr:`1320`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`seriallazer`, :user:`thehomebrewnerd`

v0.23.0 Dec 31, 2020
====================
    * Fixes
        * Fix logic for inferring variable type from unusual dtype (:pr:`1273`)
        * Allow passing entities without relationships to ``calculate_feature_matrix`` (:pr:`1290`)
    * Changes
        * Move ``query_by_values`` method from ``Entity`` to ``EntitySet`` (:pr:`1251`)
        * Move ``_handle_time`` method from ``Entity`` to ``EntitySet`` (:pr:`1276`)
        * Remove usage of ``ravel`` to resolve unexpected warning with pandas 1.2.0 (:pr:`1286`)
    * Documentation Changes
        * Fix installation command for Add-ons (:pr:`1279`)
        * Fix various broken links in documentation (:pr:`1313`)
    * Testing Changes
        * Use repository-scoped token for dependency check (:pr:`1245`:, :pr:`1248`)
        * Fix install error during docs CI test (:pr:`1250`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`thehomebrewnerd`

Breaking Changes
++++++++++++++++

* ``Entity.query_by_values`` has been removed and replaced by ``EntitySet.query_by_values`` with an
  added ``entity_id`` parameter to specify which entity in the entityset should be used for the query.

v0.22.0 Nov 30, 2020
====================
    * Enhancements
        * Allow variable descriptions to be set directly on variable (:pr:`1207`)
        * Add ability to add feature description captions to feature lineage graphs (:pr:`1212`)
        * Add support for local tar file in read_entityset (:pr:`1228`)
    * Fixes
        * Updates to fix unit test errors from koalas 1.4 (:pr:`1230`, :pr:`1232`)
    * Documentation Changes
        * Removed link to unused feedback board (:pr:`1220`)
        * Update footer with Alteryx Innovation Labs (:pr:`1221`)
        * Update links to repo in documentation to use alteryx org url (:pr:`1224`)
    * Testing Changes
        * Update release notes check to use new repo url (:pr:`1222`)
        * Use new version of pull request Github Action (:pr:`1234`)
        * Upgrade pip during featuretools[complete] test (:pr:`1236`)
        * Migrated CI tests to github actions (:pr:`1226`, :pr:`1237`, :pr:`1239`)

    Thanks to the following people for contributing to this release:
    :user:`frances-h`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`kmax12`, :user:`rwedge`, :user:`thehomebrewnerd`

v0.21.0 Oct 30, 2020
====================
    * Enhancements
        * Add ``describe_feature`` to generate an English language feature description for a given feature (:pr:`1201`)
    * Fixes
        * Update ``EntitySet.add_last_time_indexes`` to work with Koalas 1.3.0 (:pr:`1192`, :pr:`1202`)
    * Changes
        * Keep koalas requirements in separate file (:pr:`1195`)
    * Documentation Changes
        * Added footer to the documentation (:pr:`1189`)
        * Add guide for feature selection functions (:pr:`1184`)
        * Fix README.md badge with correct link (:pr:`1200`)
    * Testing Changes
        * Add ``pyspark`` and ``koalas`` to automated dependency checks (:pr:`1191`)
        * Add DockerHub credentials to CI testing environment (:pr:`1204`)
        * Update premium primitives job name on CI (:pr:`1205`)

    Thanks to the following people for contributing to this release:
    :user:`frances-h`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`

v0.20.0 Sep 30, 2020
====================
    .. warning::
        The Text variable type has been deprecated and been replaced with the NaturalLanguage variable type. The Text variable type will be removed in a future release.

    * Fixes
        * Allow FeatureOutputSlice features to be serialized (:pr:`1150`)
        * Fix duplicate label column generation when labels are passed in cutoff times and approximate is being used (:pr:`1160`)
        * Determine calculate_feature_matrix behavior with approximate and a cutoff df that is a subclass of a pandas DataFrame (:pr:`1166`)
    * Changes
        * Text variable type has been replaced with NaturalLanguage (:pr:`1159`)
    * Documentation Changes
        * Update release doc for clarity and to add Future Release template (:pr:`1151`)
        * Use the PyData Sphinx theme (:pr:`1169`)
    * Testing Changes
        * Stop requiring single-threaded dask scheduler in tests (:pr:`1163`, :pr:`1170`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`rwedge`, :user:`tamargrey`, :user:`tuethan1999`

v0.19.0 Sep 8, 2020
===================
    * Enhancements
        * Support use of Koalas DataFrames in entitysets (:pr:`1031`)
        * Add feature selection functions for null, correlated, and single value features (:pr:`1126`)
    * Fixes
        * Fix ``encode_features`` converting excluded feature columns to a numeric dtype (:pr:`1123`)
        * Improve performance of unused primitive check in dfs (:pr:`1140`)
    * Changes
        * Remove the ability to stack transform primitives (:pr:`1119`, :pr:`1145`)
        * Sort primitives passed to ``dfs`` to get consistent ordering of features\* (:pr:`1119`)
    * Documentation Changes
        * Added return values to dfs and calculate_feature_matrix (:pr:`1125`)
    * Testing Changes
        * Better test case for normalizing from no time index to time index (:pr:`1113`)

    \* When passing multiple instances of a primitive built with ``make_trans_primitive``
    or ``maxe_agg_primitive``, those instances must have the same relative order when passed
    to ``dfs`` to ensure a consistent ordering of features.

    Thanks to the following people for contributing to this release:
    :user:`frances-h`, :user:`gsheni`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`tuethan1999`


Breaking Changes
++++++++++++++++

* ``ft.dfs`` will no longer build features from Transform primitives where one
  of the inputs is a Transform feature, a GroupByTransform feature,
  or a Direct Feature of a Transform / GroupByTransform feature. This will make some
  features that would previously be generated by ``ft.dfs`` only possible if
  explicitly specified in ``seed_features``.

v0.18.1 Aug 12, 2020
====================
    * Fixes
        * Fix ``EntitySet.plot()`` when given a dask entityset (:pr:`1086`)
    * Changes
        * Use ``nlp-primitives[complete]`` install for ``nlp_primitives`` extra in ``setup.py`` (:pr:`1103`)
    * Documentation Changes
        * Fix broken downloads badge in README.md (:pr:`1107`)
    * Testing Changes
        * Use CircleCI matrix jobs in config to trigger multiple runs of same job with different parameters (:pr:`1105`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`systemshift`, :user:`thehomebrewnerd`

v0.18.0 Jul 31, 2020
====================
    * Enhancements
        * Warn user if supplied primitives are not used during dfs (:pr:`1073`)
    * Fixes
        * Use more consistent and uniform warnings (:pr:`1040`)
        * Fix issue with missing instance ids and categorical entity index (:pr:`1050`)
        * Remove warnings.simplefilter in feature_set_calculator to un-silence warnings (:pr:`1053`)
        * Fix feature visualization for features with '>' or '<' in name (:pr:`1055`)
        * Fix boolean dtype mismatch between encode_features and dfs and calculate_feature_matrix (:pr:`1082`)
        * Update primitive options to check reversed inputs if primitive is commutative (:pr:`1085`)
        * Fix inconsistent ordering of features between kernel restarts (:pr:`1088`)
    * Changes
        * Make DFS match ``TimeSince`` primitive with all ``Datetime`` types (:pr:`1048`)
        * Change default branch to ``main`` (:pr:`1038`)
        * Raise TypeError if improper input is supplied to ``Entity.delete_variables()`` (:pr:`1064`)
        * Updates for compatibility with pandas 1.1.0 (:pr:`1079`, :pr:`1089`)
        * Set pandas version to pandas>=0.24.1,<2.0.0. Filter pandas deprecation warning in Week primitive. (:pr:`1094`)
    * Documentation Changes
        * Remove benchmarks folder (:pr:`1049`)
        * Add custom variables types section to variables page (:pr:`1066`)
    * Testing Changes
        * Add fixture for ``ft.demo.load_mock_customer`` (:pr:`1036`)
        * Refactor Dask test units (:pr:`1052`)
        * Implement automated process for checking critical dependencies (:pr:`1045`, :pr:`1054`, :pr:`1081`)
        * Don't run changelog check for release PRs or automated dependency PRs (:pr:`1057`)
        * Fix non-deterministic behavior in Dask test causing codecov issues (:pr:`1070`)

    Thanks to the following people for contributing to this release:
    :user:`frances-h`, :user:`gsheni`, :user:`monti-python`, :user:`rwedge`,
    :user:`systemshift`,  :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`wsankey`

v0.17.0 Jun 30, 2020
====================
    * Enhancements
        * Add ``list_variable_types`` and ``graph_variable_types`` for Variable Types (:pr:`1013`)
        * Add ``graph_feature`` to generate a feature lineage graph for a given feature (:pr:`1032`)
    * Fixes
        * Improve warnings when using a Dask dataframe for cutoff times (:pr:`1026`)
        * Error if attempting to add entityset relationship where child variable is also child index (:pr:`1034`)
    * Changes
        * Remove ``Feature.get_names`` (:pr:`1021`)
        * Remove unnecessary ``pd.Series`` and ``pd.DatetimeIndex`` calls from primitives (:pr:`1020`, :pr:`1024`)
        * Improve cutoff time handling when a single value or no value is passed (:pr:`1028`)
        * Moved ``find_variable_types`` to Variable utils (:pr:`1013`)
    * Documentation Changes
        * Add page on Variable Types to describe some Variable Types, and util functions (:pr:`1013`)
        * Remove featuretools enterprise from documentation (:pr:`1022`)
        * Add development install instructions to contributing.md (:pr:`1030`)
    * Testing Changes
        * Add ``required`` flag to CircleCI codecov upload command (:pr:`1035`)

    Thanks to the following people for contributing to this release:
    :user:`frances-h`, :user:`gsheni`, :user:`kmax12`, :user:`rwedge`,
    :user:`thehomebrewnerd`, :user:`tuethan1999`

Breaking Changes
++++++++++++++++

* Removed ``Feature.get_names``, ``Feature.get_feature_names`` should be used instead

v0.16.0 Jun 5, 2020
===================
    * Enhancements
        * Support use of Dask DataFrames in entitysets (:pr:`783`)
        * Add ``make_index`` when initializing an EntitySet by passing in an ``entities`` dictionary (:pr:`1010`)
        * Add ability to use primitive classes and instances as keys in primitive_options dictionary (:pr:`993`)
    * Fixes
        * Cleanly close tqdm instance (:pr:`1018`)
        * Resolve issue with ``NaN`` values in ``LatLong`` columns (:pr:`1007`)
    * Testing Changes
        * Update tests for numpy v1.19.0 compatability (:pr:`1016`)

    Thanks to the following people for contributing to this release:
    :user:`Alex-Monahan`, :user:`frances-h`, :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd`

v0.15.0 May 29, 2020
====================
    * Enhancements
        * Add ``get_default_aggregation_primitives`` and ``get_default_transform_primitives`` (:pr:`945`)
        * Allow cutoff time dataframe columns to be in any order (:pr:`969`, :pr:`995`)
        * Add Age primitive, and make it a default transform primitive for DFS (:pr:`987`)
        * Add ``include_cutoff_time`` arg - control whether data at cutoff times are included in feature calculations (:pr:`959`)
        * Allow ``variables_types`` to be referenced by their ``type_string``
          for the ``entity_from_dataframe`` function (:pr:`988`)
    * Fixes
        * Fix errors with Equals and NotEquals primitives when comparing categoricals or different dtypes (:pr:`968`)
        * Normalized type_strings of ``Variable`` classes so that the ``find_variable_types`` function produces a
          dictionary with a clear key to name transition (:pr:`982`, :pr:`996`)
        * Remove pandas.datetime in test_calculate_feature_matrix due to deprecation (:pr:`998`)
    * Documentation Changes
        * Add python 3.8 support for docs (:pr:`983`)
        * Adds consistent Entityset Docstrings (:pr:`986`)
    * Testing Changes
        * Add automated tests for python 3.8 environment (:pr:`847`)
        * Update testing dependencies (:pr:`976`)

    Thanks to the following people for contributing to this release:
    :user:`ctduffy`, :user:`frances-h`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`rightx2`, :user:`rwedge`, :user:`sebrahimi1988`, :user:`thehomebrewnerd`,  :user:`tuethan1999`

Breaking Changes
++++++++++++++++

* Calls to ``featuretools.dfs`` or ``featuretools.calculate_feature_matrix`` that use a cutoff time
  dataframe, but do not label the time column with either the target entity time index variable name or
  as ``time``, will now result in an ``AttributeError``. Previously, the time column was selected to be the first
  column that was not the instance id column. With this update, the position of the column in the dataframe is
  no longer used to determine the time column. Now, both instance id columns and time columns in a cutoff time
  dataframe can be in any order as long as they are named properly.

* The ``type_string`` attributes of all ``Variable`` subclasses are now a snake case conversion of their class names. This
  changes the ``type_string`` of the ``Unknown``, ``IPAddress``, ``EmailAddress``, ``SubRegionCode``, ``FilePath``, ``LatLong``, and ``ZIPcode`` classes.
  Old saved entitysets that used these variables may load incorrectly.

v0.14.0 Apr 30, 2020
====================
    * Enhancements
        * ft.encode_features - use less memory for one-hot encoded columns (:pr:`876`)
    * Fixes
        * Use logger.warning to fix deprecated logger.warn (:pr:`871`)
        * Add dtype to interesting_values to fix deprecated empty Series with no dtype (:pr:`933`)
        * Remove overlap in training windows (:pr:`930`)
        * Fix progress bar in notebook (:pr:`932`)
    * Changes
        * Change premium primitives CI test to Python 3.6 (:pr:`916`)
        * Remove Python 3.5 support (:pr:`917`)
    * Documentation Changes
        * Fix README links to docs (:pr:`872`)
        * Fix Github links with correct organizations (:pr:`908`)
        * Fix hyperlinks in docs and docstrings with updated address (:pr:`910`)
        * Remove unused script for uploading docs to AWS (:pr:`911`)

    Thanks to the following people for contributing to this release:
    :user:`frances-h`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`

Breaking Changes
++++++++++++++++

* Using training windows in feature calculations can result in different values than previous versions.
  This was done to prevent consecutive training windows from overlapping by excluding data at the oldest point in time.
  For example, if we use a cutoff time at the first minute of the hour with a one hour training window,
  the first minute of the previous hour will no longer be included in the feature calculation.

v0.13.4 Mar 27, 2020
====================
    .. warning::
        The next non-bugfix release of Featuretools will not support Python 3.5

    * Fixes
        * Fix ft.show_info() not displaying in Jupyter notebooks (:pr:`863`)
    * Changes
        * Added Plugin Warnings at Entry Point (:pr:`850`, :pr:`869`)
    * Documentation Changes
        * Add links to primitives.featurelabs.com (:pr:`860`)
        * Add source code links to API reference (:pr:`862`)
        * Update links for testing Dask/Spark integrations (:pr:`867`)
        * Update release documentation for featuretools (:pr:`868`)
    * Testing Changes
        * Miscellaneous changes (:pr:`861`)

    Thanks to the following people for contributing to this release:
    :user:`frances-h`, :user:`FreshLeaf8865`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`thehomebrewnerd`

v0.13.3 Feb 28, 2020
====================
    * Fixes
        * Fix a connection closed error when using n_jobs (:pr:`853`)
    * Changes
        * Pin msgpack dependency for Python 3.5; remove dataframe from Dask dependency (:pr:`851`)
    * Documentation Changes
        * Update link to help documentation page in Github issue template (:pr:`855`)

    Thanks to the following people for contributing to this release:
    :user:`frances-h`, :user:`rwedge`

v0.13.2 Jan 31, 2020
====================
    * Enhancements
        * Support for Pandas 1.0.0 (:pr:`844`)
    * Changes
        * Remove dependency on s3fs library for anonymous downloads from S3 (:pr:`825`)
    * Testing Changes
        * Added GitHub Action to automatically run performance tests (:pr:`840`)

    Thanks to the following people for contributing to this release:
    :user:`frances-h`, :user:`rwedge`

v0.13.1 Dec 28, 2019
====================
    * Fixes
        * Raise error when given wrong input for ignore_variables (:pr:`826`)
        * Fix multi-output features not created when there is no child data (:pr:`834`)
        * Removing type casting in Equals and NotEquals primitives (:pr:`504`)
    * Changes
        * Replace pd.timedelta time units that were deprecated (:pr:`822`)
        * Move sklearn wrapper to separate library (:pr:`835`, :pr:`837`)
    * Testing Changes
        * Run unit tests in windows environment (:pr:`790`)
        * Update boto3 version requirement for tests (:pr:`838`)

    Thanks to the following people for contributing to this release:
    :user:`jeffzi`, :user:`kmax12`, :user:`rwedge`, :user:`systemshift`

v0.13.0 Nov 30, 2019
====================
    * Enhancements
        * Added GitHub Action to auto upload releases to PyPI (:pr:`816`)
    * Fixes
        * Fix issue where some primitive options would not be applied (:pr:`807`)
        * Fix issue with converting to pickle or parquet after adding interesting features (:pr:`798`, :pr:`823`)
        * Diff primitive now calculates using all available data (:pr:`824`)
        * Prevent DFS from creating Identity Features of globally ignored variables (:pr:`819`)
    * Changes
        * Remove python 2.7 support from serialize.py (:pr:`812`)
        * Make smart_open, boto3, and s3fs optional dependencies (:pr:`827`)
    * Documentation Changes
        * remove python 2.7 support and add 3.7 in install.rst (:pr:`805`)
        * Fix import error in docs (:pr:`803`)
        * Fix release title formatting in changelog (:pr:`806`)
    * Testing Changes
        * Use multiple CPUS to run tests on CI (:pr:`811`)
        * Refactor test entityset creation to avoid saving to disk (:pr:`813`, :pr:`821`)
        * Remove get_values() from test_es.py to remove warnings (:pr:`820`)

    Thanks to the following people for contributing to this release:
    :user:`frances-h`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`systemshift`

Breaking Changes
++++++++++++++++

* The libraries used for downloading or uploading from S3 or URLs are now
  optional and will no longer be installed by default.  To use this
  functionality they will need to be installed separately.
* The fix to how the Diff primitive is calculated may slow down the overall
  calculation time of feature lists that use this primitive.

v0.12.0 Oct 31, 2019
====================
    * Enhancements
        * Added First primitive (:pr:`770`)
        * Added Entropy aggregation primitive (:pr:`779`)
        * Allow custom naming for multi-output primitives (:pr:`780`)
    * Fixes
        * Prevents user from removing base entity time index using additional_variables (:pr:`768`)
        * Fixes error when a multioutput primitive was supplied to dfs as a groupby trans primitive (:pr:`786`)
    * Changes
        * Drop Python 2 support (:pr:`759`)
        * Add unit parameter to AvgTimeBetween (:pr:`771`)
        * Require Pandas 0.24.1 or higher (:pr:`787`)
    * Documentation Changes
        * Update featuretools slack link (:pr:`765`)
        * Set up repo to use Read the Docs (:pr:`776`)
        * Add First primitive to API reference docs (:pr:`782`)
    * Testing Changes
        * CircleCI fixes (:pr:`774`)
        * Disable PIP progress bars (:pr:`775`)

    Thanks to the following people for contributing to this release:
    :user:`ablacke-ayx`, :user:`BoopBoopBeepBoop`, :user:`jeffzi`,
    :user:`kmax12`, :user:`rwedge`, :user:`thehomebrewnerd`, :user:`twdobson`

v0.11.0 Sep 30, 2019
====================
    .. warning::
        The next non-bugfix release of Featuretools will not support Python 2

    * Enhancements
        * Improve how files are copied and written (:pr:`721`)
        * Add number of rows to graph in entityset.plot (:pr:`727`)
        * Added support for pandas DateOffsets in DFS and Timedelta (:pr:`732`)
        * Enable feature-specific top_n value using a dictionary in encode_features (:pr:`735`)
        * Added progress_callback parameter to dfs() and calculate_feature_matrix() (:pr:`739`, :pr:`745`)
        * Enable specifying primitives on a per column or per entity basis (:pr:`748`)
    * Fixes
        * Fixed entity set deserialization (:pr:`720`)
        * Added error message when DateTimeIndex is a variable but not set as the time_index (:pr:`723`)
        * Fixed CumCount and other group-by transform primitives that take ID as input (:pr:`733`, :pr:`754`)
        * Fix progress bar undercounting (:pr:`743`)
        * Updated training_window error assertion to only check against observations (:pr:`728`)
        * Don't delete the whole destination folder while saving entityset (:pr:`717`)
    * Changes
        * Raise warning and not error on schema version mismatch (:pr:`718`)
        * Change feature calculation to return in order of instance ids provided (:pr:`676`)
        * Removed time remaining from displayed progress bar in dfs() and calculate_feature_matrix() (:pr:`739`)
        * Raise warning in normalize_entity() when time_index of base_entity has an invalid type (:pr:`749`)
        * Remove toolz as a direct dependency (:pr:`755`)
        * Allow boolean variable types to be used in the Multiply primitive (:pr:`756`)
    * Documentation Changes
        * Updated URL for Compose (:pr:`716`)
    * Testing Changes
        * Update dependencies (:pr:`738`, :pr:`741`, :pr:`747`)

    Thanks to the following people for contributing to this release:
    :user:`angela97lin`, :user:`chidauri`, :user:`christopherbunn`,
    :user:`frances-h`, :user:`jeff-hernandez`, :user:`kmax12`,
    :user:`MarcoGorelli`, :user:`rwedge`, :user:`thehomebrewnerd`

Breaking Changes
++++++++++++++++

* Feature calculations will return in the order of instance ids provided instead of the order of time points instances are calculated at.

v0.10.1 Aug 25, 2019
====================
    * Fixes
        * Fix serialized LatLong data being loaded as strings (:pr:`712`)
    * Documentation Changes
        * Fixed FAQ cell output (:pr:`710`)

    Thanks to the following people for contributing to this release:
    :user:`gsheni`, :user:`rwedge`


v0.10.0 Aug 19, 2019
====================
    .. warning::
        The next non-bugfix release of Featuretools will not support Python 2


    * Enhancements
        * Give more frequent progress bar updates and update chunk size behavior (:pr:`631`, :pr:`696`)
        * Added drop_first as param in encode_features (:pr:`647`)
        * Added support for stacking multi-output primitives (:pr:`679`)
        * Generate transform features of direct features (:pr:`623`)
        * Added serializing and deserializing from S3 and deserializing from URLs (:pr:`685`)
        * Added nlp_primitives as an add-on library (:pr:`704`)
        * Added AutoNormalize to Featuretools plugins (:pr:`699`)
        * Added functionality for relative units (month/year) in Timedelta (:pr:`692`)
        * Added categorical-encoding as an add-on library (:pr:`700`)
    * Fixes
        * Fix performance regression in DFS (:pr:`637`)
        * Fix deserialization of feature relationship path (:pr:`665`)
        * Set index after adding ancestor relationship variables (:pr:`668`)
        * Fix user-supplied variable_types modification in Entity init (:pr:`675`)
        * Don't calculate dependencies of unnecessary features (:pr:`667`)
        * Prevent normalize entity's new entity having same index as base entity (:pr:`681`)
        * Update variable type inference to better check for string values (:pr:`683`)
    * Changes
        * Moved dask, distributed imports (:pr:`634`)
    * Documentation Changes
        * Miscellaneous changes (:pr:`641`, :pr:`658`)
        * Modified doc_string of top_n in encoding (:pr:`648`)
        * Hyperlinked ComposeML (:pr:`653`)
        * Added FAQ (:pr:`620`, :pr:`677`)
        * Fixed FAQ question with multiple question marks (:pr:`673`)
    * Testing Changes
        * Add master, and release tests for premium primitives (:pr:`660`, :pr:`669`)
        * Miscellaneous changes (:pr:`672`, :pr:`674`)

    Thanks to the following people for contributing to this release:
    :user:`alexjwang`, :user:`allisonportis`, :user:`ayushpatidar`,
    :user:`CJStadler`, :user:`ctduffy`, :user:`gsheni`, :user:`jeff-hernandez`,
    :user:`jeremyliweishih`, :user:`kmax12`, :user:`rwedge`, :user:`zhxt95`,

v0.9.1 Jul 3, 2019
====================
    * Enhancements
        * Speedup groupby transform calculations (:pr:`609`)
        * Generate features along all paths when there are multiple paths between entities (:pr:`600`, :pr:`608`)
    * Fixes
        * Select columns of dataframe using a list (:pr:`615`)
        * Change type of features calculated on Index features to Categorical (:pr:`602`)
        * Filter dataframes through forward relationships (:pr:`625`)
        * Specify Dask version in requirements for python 2 (:pr:`627`)
        * Keep dataframe sorted by time during feature calculation (:pr:`626`)
        * Fix bug in encode_features that created duplicate columns of
          features with multiple outputs (:pr:`622`)
    * Changes
        * Remove unused variance_selection.py file (:pr:`613`)
        * Remove Timedelta data param (:pr:`619`)
        * Remove DaysSince primitive (:pr:`628`)
    * Documentation Changes
        * Add installation instructions for add-on libraries (:pr:`617`)
        * Clarification of Multi Output Feature Creation (:pr:`638`)
        * Miscellaneous changes (:pr:`632`, :pr:`639`)
    * Testing Changes
        * Miscellaneous changes (:pr:`595`, :pr:`612`)

    Thanks to the following people for contributing to this release:
    :user:`CJStadler`, :user:`kmax12`, :user:`rwedge`, :user:`gsheni`, :user:`kkleidal`, :user:`ctduffy`

v0.9.0 Jun 19, 2019
===================
    * Enhancements
        * Add unit parameter to timesince primitives (:pr:`558`)
        * Add ability to install optional add on libraries (:pr:`551`)
        * Load and save features from open files and strings (:pr:`566`)
        * Support custom variable types (:pr:`571`)
        * Support entitysets which have multiple paths between two entities (:pr:`572`, :pr:`544`)
        * Added show_info function, more output information added to CLI `featuretools info` (:pr:`525`)
    * Fixes
        * Normalize_entity specifies error when 'make_time_index' is an invalid string (:pr:`550`)
        * Schema version added for entityset serialization (:pr:`586`)
        * Renamed features have names correctly serialized (:pr:`585`)
        * Improved error message for index/time_index being the same column in normalize_entity and entity_from_dataframe (:pr:`583`)
        * Removed all mentions of allow_where (:pr:`587`, :pr:`588`)
        * Removed unused variable in normalize entity (:pr:`589`)
        * Change time since return type to numeric (:pr:`606`)
    * Changes
        * Refactor get_pandas_data_slice to take single entity (:pr:`547`)
        * Updates TimeSincePrevious and Diff Primitives (:pr:`561`)
        * Remove unecessary time_last variable (:pr:`546`)
    * Documentation Changes
        * Add Featuretools Enterprise to documentation (:pr:`563`)
        * Miscellaneous changes (:pr:`552`, :pr:`573`, :pr:`577`, :pr:`599`)
    * Testing Changes
        * Miscellaneous changes (:pr:`559`, :pr:`569`, :pr:`570`, :pr:`574`, :pr:`584`, :pr:`590`)

    Thanks to the following people for contributing to this release:
    :user:`alexjwang`, :user:`allisonportis`, :user:`CJStadler`, :user:`ctduffy`, :user:`gsheni`, :user:`kmax12`, :user:`rwedge`

v0.8.0 May 17, 2019
===================
    * Rename NUnique to NumUnique (:pr:`510`)
    * Serialize features as JSON (:pr:`532`)
    * Drop all variables at once in normalize_entity (:pr:`533`)
    * Remove unnecessary sorting from normalize_entity (:pr:`535`)
    * Features cache their names (:pr:`536`)
    * Only calculate features for instances before cutoff (:pr:`523`)
    * Remove all relative imports (:pr:`530`)
    * Added FullName Variable Type (:pr:`506`)
    * Add error message when target entity does not exist (:pr:`520`)
    * New demo links (:pr:`542`)
    * Remove duplicate features check in DFS (:pr:`538`)
    * featuretools_primitives entry point expects list of primitive classes (:pr:`529`)
    * Update ALL_VARIABLE_TYPES list (:pr:`526`)
    * More Informative N Jobs Prints and Warnings (:pr:`511`)
    * Update sklearn version requirements (:pr:`541`)
    * Update Makefile (:pr:`519`)
    * Remove unused parameter in Entity._handle_time (:pr:`524`)
    * Remove build_ext code from setup.py (:pr:`513`)
    * Documentation updates (:pr:`512`, :pr:`514`, :pr:`515`, :pr:`521`, :pr:`522`, :pr:`527`, :pr:`545`)
    * Testing updates (:pr:`509`, :pr:`516`, :pr:`517`, :pr:`539`)

    Thanks to the following people for contributing to this release: :user:`bphi`, :user:`CharlesBradshaw`, :user:`CJStadler`, :user:`glentennis`, :user:`gsheni`, :user:`kmax12`, :user:`rwedge`

Breaking Changes
++++++++++++++++

* ``NUnique`` has been renamed to ``NumUnique``.

    Previous behavior

    .. code-block:: python

        from featuretools.primitives import NUnique

    New behavior

    .. code-block:: python

        from featuretools.primitives import NumUnique

v0.7.1 Apr 24, 2019
===================
    * Automatically generate feature name for controllable primitives (:pr:`481`)
    * Primitive docstring updates (:pr:`489`, :pr:`492`, :pr:`494`, :pr:`495`)
    * Change primitive functions that returned strings to return functions (:pr:`499`)
    * CLI customizable via entrypoints (:pr:`493`)
    * Improve calculation of aggregation features on grandchildren (:pr:`479`)
    * Refactor entrypoints to use decorator (:pr:`483`)
    * Include doctests in testing suite (:pr:`491`)
    * Documentation updates (:pr:`490`)
    * Update how standard primitives are imported internally (:pr:`482`)

    Thanks to the following people for contributing to this release: :user:`bukosabino`, :user:`CharlesBradshaw`, :user:`glentennis`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`kmax12`, :user:`minkvsky`, :user:`rwedge`, :user:`thehomebrewnerd`

v0.7.0 Mar 29, 2019
===================
    * Improve Entity Set Serialization (:pr:`361`)
    * Support calling a primitive instance's function directly (:pr:`461`, :pr:`468`)
    * Support other libraries extending featuretools functionality via entrypoints (:pr:`452`)
    * Remove featuretools install command (:pr:`475`)
    * Add GroupByTransformFeature (:pr:`455`, :pr:`472`, :pr:`476`)
    * Update Haversine Primitive (:pr:`435`, :pr:`462`)
    * Add commutative argument to SubtractNumeric and DivideNumeric primitives (:pr:`457`)
    * Add FilePath variable_type (:pr:`470`)
    * Add PhoneNumber, DateOfBirth, URL variable types (:pr:`447`)
    * Generalize infer_variable_type, convert_variable_data and convert_all_variable_data methods (:pr:`423`)
    * Documentation updates (:pr:`438`, :pr:`446`, :pr:`458`, :pr:`469`)
    * Testing updates (:pr:`440`, :pr:`444`, :pr:`445`, :pr:`459`)

    Thanks to the following people for contributing to this release: :user:`bukosabino`, :user:`CharlesBradshaw`, :user:`ColCarroll`, :user:`glentennis`, :user:`grayskripko`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`jrkinley`, :user:`kmax12`, :user:`RogerTangos`, :user:`rwedge`

Breaking Changes
++++++++++++++++

* ``ft.dfs`` now has a ``groupby_trans_primitives`` parameter that DFS uses to automatically construct features that group by an ID column and then apply a transform primitive to search group. This change applies to the following primitives: ``CumSum``, ``CumCount``, ``CumMean``, ``CumMin``, and ``CumMax``.

    Previous behavior

    .. code-block:: python

        ft.dfs(entityset=es,
               target_entity='customers',
               trans_primitives=["cum_mean"])

    New behavior

    .. code-block:: python

        ft.dfs(entityset=es,
               target_entity='customers',
               groupby_trans_primitives=["cum_mean"])

* Related to the above change, cumulative transform features are now defined using a new feature class, ``GroupByTransformFeature``.

    Previous behavior

    .. code-block:: python

        ft.Feature([base_feature, groupby_feature], primitive=CumulativePrimitive)


    New behavior

    .. code-block:: python

        ft.Feature(base_feature, groupby=groupby_feature, primitive=CumulativePrimitive)


v0.6.1 Feb 15, 2019
===================
    * Cumulative primitives (:pr:`410`)
    * Entity.query_by_values now preserves row order of underlying data (:pr:`428`)
    * Implementing Country Code and Sub Region Codes as variable types (:pr:`430`)
    * Added IPAddress and EmailAddress variable types (:pr:`426`)
    * Install data and dependencies (:pr:`403`)
    * Add TimeSinceFirst, fix TimeSinceLast (:pr:`388`)
    * Allow user to pass in desired feature return types (:pr:`372`)
    * Add new configuration object (:pr:`401`)
    * Replace NUnique get_function (:pr:`434`)
    * _calculate_idenity_features now only returns the features asked for, instead of the entire entity (:pr:`429`)
    * Primitive function name uniqueness (:pr:`424`)
    * Update NumCharacters and NumWords primitives (:pr:`419`)
    * Removed Variable.dtype (:pr:`416`, :pr:`433`)
    * Change to zipcode rep, str for pandas (:pr:`418`)
    * Remove pandas version upper bound (:pr:`408`)
    * Make S3 dependencies optional (:pr:`404`)
    * Check that agg_primitives and trans_primitives are right primitive type (:pr:`397`)
    * Mean primitive changes (:pr:`395`)
    * Fix transform stacking on multi-output aggregation (:pr:`394`)
    * Fix list_primitives (:pr:`391`)
    * Handle graphviz dependency (:pr:`389`, :pr:`396`, :pr:`398`)
    * Testing updates (:pr:`402`, :pr:`417`, :pr:`433`)
    * Documentation updates (:pr:`400`, :pr:`409`, :pr:`415`, :pr:`417`, :pr:`420`, :pr:`421`, :pr:`422`, :pr:`431`)


    Thanks to the following people for contributing to this release:  :user:`CharlesBradshaw`, :user:`csala`, :user:`floscha`, :user:`gsheni`, :user:`jxwolstenholme`, :user:`kmax12`, :user:`RogerTangos`, :user:`rwedge`

v0.6.0 Jan 30, 2018
===================
    * Primitive refactor (:pr:`364`)
    * Mean ignore NaNs (:pr:`379`)
    * Plotting entitysets (:pr:`382`)
    * Add seed features later in DFS process (:pr:`357`)
    * Multiple output column features (:pr:`376`)
    * Add ZipCode Variable Type (:pr:`367`)
    * Add `primitive.get_filepath` and example of primitive loading data from external files (:pr:`380`)
    * Transform primitives take series as input (:pr:`385`)
    * Update dependency requirements (:pr:`378`, :pr:`383`, :pr:`386`)
    * Add modulo to override tests (:pr:`384`)
    * Update documentation (:pr:`368`, :pr:`377`)
    * Update README.md (:pr:`366`, :pr:`373`)
    * Update CI tests (:pr:`359`, :pr:`360`, :pr:`375`)

    Thanks to the following people for contributing to this release: :user:`floscha`, :user:`gsheni`, :user:`kmax12`, :user:`RogerTangos`, :user:`rwedge`

v0.5.1 Dec 17, 2018
===================
    * Add missing dependencies (:pr:`353`)
    * Move comment to note in documentation (:pr:`352`)

v0.5.0 Dec 17, 2018
===================
    * Add specific error for duplicate additional/copy_variables in normalize_entity (:pr:`348`)
    * Removed EntitySet._import_from_dataframe (:pr:`346`)
    * Removed time_index_reduce parameter (:pr:`344`)
    * Allow installation of additional primitives (:pr:`326`)
    * Fix DatetimeIndex variable conversion (:pr:`342`)
    * Update Sklearn DFS Transformer (:pr:`343`)
    * Clean up entity creation logic (:pr:`336`)
    * remove casting to list in transform feature calculation (:pr:`330`)
    * Fix sklearn wrapper (:pr:`335`)
    * Add readme to pypi
    * Update conda docs after move to conda-forge (:pr:`334`)
    * Add wrapper for scikit-learn Pipelines (:pr:`323`)
    * Remove parse_date_cols parameter from EntitySet._import_from_dataframe (:pr:`333`)

    Thanks to the following people for contributing to this release: :user:`bukosabino`, :user:`georgewambold`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`kmax12`, and :user:`rwedge`.

v0.4.1 Nov 29, 2018
===================
    * Resolve bug preventing using first column as index by default (:pr:`308`)
    * Handle return type when creating features from Id variables (:pr:`318`)
    * Make id an optional parameter of EntitySet constructor (:pr:`324`)
    * Handle primitives with same function being applied to same column (:pr:`321`)
    * Update requirements (:pr:`328`)
    * Clean up DFS arguments (:pr:`319`)
    * Clean up Pandas Backend (:pr:`302`)
    * Update properties of cumulative transform primitives (:pr:`320`)
    * Feature stability between versions documentation (:pr:`316`)
    * Add download count to GitHub readme (:pr:`310`)
    * Fixed #297 update tests to check error strings (:pr:`303`)
    * Remove usage of fixtures in agg primitive tests (:pr:`325`)

v0.4.0 Oct 31, 2018
===================
    * Remove ft.utils.gen_utils.getsize and make pympler a test requirement (:pr:`299`)
    * Update requirements.txt (:pr:`298`)
    * Refactor EntitySet.find_path(...) (:pr:`295`)
    * Clean up unused methods (:pr:`293`)
    * Remove unused parents property of Entity (:pr:`283`)
    * Removed relationships parameter (:pr:`284`)
    * Improve time index validation (:pr:`285`)
    * Encode features with "unknown" class in categorical (:pr:`287`)
    * Allow where clauses on direct features in Deep Feature Synthesis (:pr:`279`)
    * Change to fullargsspec (:pr:`288`)
    * Parallel verbose fixes (:pr:`282`)
    * Update tests for python 3.7 (:pr:`277`)
    * Check duplicate rows cutoff times (:pr:`276`)
    * Load retail demo data using compressed file (:pr:`271`)

v0.3.1 Sep 28, 2018
===================
    * Handling time rewrite (:pr:`245`)
    * Update deep_feature_synthesis.py (:pr:`249`)
    * Handling return type when creating features from DatetimeTimeIndex (:pr:`266`)
    * Update retail.py (:pr:`259`)
    * Improve Consistency of Transform Primitives (:pr:`236`)
    * Update demo docstrings (:pr:`268`)
    * Handle non-string column names (:pr:`255`)
    * Clean up merging of aggregation primitives (:pr:`250`)
    * Add tests for Entity methods (:pr:`262`)
    * Handle no child data when calculating aggregation features with multiple arguments (:pr:`264`)
    * Add `is_string` utils function (:pr:`260`)
    * Update python versions to match docker container (:pr:`261`)
    * Handle where clause when no child data (:pr:`258`)
    * No longer cache demo csvs, remove config file (:pr:`257`)
    * Avoid stacking "expanding" primitives (:pr:`238`)
    * Use randomly generated names in retail csv (:pr:`233`)
    * Update README.md (:pr:`243`)

v0.3.0 Aug 27, 2018
===================
    * Improve performance of all feature calculations (:pr:`224`)
    * Update agg primitives to use more efficient functions (:pr:`215`)
    * Optimize metadata calculation (:pr:`229`)
    * More robust handling when no data at a cutoff time (:pr:`234`)
    * Workaround categorical merge (:pr:`231`)
    * Switch which CSV is associated with which variable (:pr:`228`)
    * Remove unused kwargs from query_by_values, filter_and_sort (:pr:`225`)
    * Remove convert_links_to_integers (:pr:`219`)
    * Add conda install instructions (:pr:`223`, :pr:`227`)
    * Add example of using Dask to parallelize to docs  (:pr:`221`)

v0.2.2 Aug 20, 2018
===================
    * Remove unnecessary check no related instances call and refactor (:pr:`209`)
    * Improve memory usage through support for pandas categorical types (:pr:`196`)
    * Bump minimum pandas version from 0.20.3 to 0.23.0 (:pr:`216`)
    * Better parallel memory warnings (:pr:`208`, :pr:`214`)
    * Update demo datasets (:pr:`187`, :pr:`201`, :pr:`207`)
    * Make primitive lookup case insensitive  (:pr:`213`)
    * Use capital name (:pr:`211`)
    * Set class name for Min (:pr:`206`)
    * Remove ``variable_types`` from normalize entity (:pr:`205`)
    * Handle parquet serialization with last time index (:pr:`204`)
    * Reset index of cutoff times in calculate feature matrix (:pr:`198`)
    * Check argument types for .normalize_entity (:pr:`195`)
    * Type checking ignore entities.  (:pr:`193`)

v0.2.1 Jul 2, 2018
==================
    * Cpu count fix (:pr:`176`)
    * Update flight (:pr:`175`)
    * Move feature matrix calculation helper functions to separate file (:pr:`177`)

v0.2.0 Jun 22, 2018
===================
    * Multiprocessing (:pr:`170`)
    * Handle unicode encoding in repr throughout Featuretools (:pr:`161`)
    * Clean up EntitySet class (:pr:`145`)
    * Add support for building and uploading conda package (:pr:`167`)
    * Parquet serialization (:pr:`152`)
    * Remove variable stats (:pr:`171`)
    * Make sure index variable comes first (:pr:`168`)
    * No last time index update on normalize (:pr:`169`)
    * Remove list of times as on option for `cutoff_time` in `calculate_feature_matrix` (:pr:`165`)
    * Config does error checking to see if it can write to disk (:pr:`162`)


v0.1.21 May 30, 2018
====================
    * Support Pandas 0.23.0 (:pr:`153`, :pr:`154`, :pr:`155`, :pr:`159`)
    * No EntitySet required in loading/saving features (:pr:`141`)
    * Use s3 demo csv with better column names (:pr:`139`)
    * more reasonable start parameter (:pr:`149`)
    * add issue template (:pr:`133`)
    * Improve tests (:pr:`136`, :pr:`137`, :pr:`144`, :pr:`147`)
    * Remove unused functions (:pr:`140`, :pr:`143`, :pr:`146`)
    * Update documentation after recent changes / removals (:pr:`157`)
    * Rename demo retail csv file (:pr:`148`)
    * Add names for binary (:pr:`142`)
    * EntitySet repr to use get_name rather than id (:pr:`134`)
    * Ensure config dir is writable (:pr:`135`)

v0.1.20 Apr 13, 2018
====================
    * Primitives as strings in DFS parameters (:pr:`129`)
    * Integer time index bugfixes (:pr:`128`)
    * Add make_temporal_cutoffs utility function (:pr:`126`)
    * Show all entities, switch shape display to row/col (:pr:`124`)
    * Improved chunking when calculating feature matrices  (:pr:`121`)
    * fixed num characters nan fix (:pr:`118`)
    * modify ignore_variables docstring (:pr:`117`)

v0.1.19 Mar 21, 2018
====================
    * More descriptive DFS progress bar (:pr:`69`)
    * Convert text variable to string before NumWords (:pr:`106`)
    * EntitySet.concat() reindexes relationships (:pr:`96`)
    * Keep non-feature columns when encoding feature matrix (:pr:`111`)
    * Uses full entity update for dependencies of uses_full_entity features (:pr:`110`)
    * Update column names in retail demo (:pr:`104`)
    * Handle Transform features that need access to all values of entity (:pr:`91`)

v0.1.18 Feb 27, 2018
====================
    * fixes related instances bug (:pr:`97`)
    * Adding non-feature columns to calculated feature matrix (:pr:`78`)
    * Relax numpy version req (:pr:`82`)
    * Remove `entity_from_csv`, tests, and lint (:pr:`71`)

v0.1.17 Jan 18, 2018
====================
    * LatLong type (:pr:`57`)
    * Last time index fixes (:pr:`70`)
    * Make median agg primitives ignore nans by default (:pr:`61`)
    * Remove Python 3.4 support (:pr:`64`)
    * Change `normalize_entity` to update `secondary_time_index` (:pr:`59`)
    * Unpin requirements (:pr:`53`)
    * associative -> commutative (:pr:`56`)
    * Add Words and Chars primitives (:pr:`51`)

v0.1.16 Dec 19, 2017
====================
    * fix EntitySet.combine_variables and standardize encode_features (:pr:`47`)
    * Python 3 compatibility (:pr:`16`)

v0.1.15 Dec 18, 2017
====================
    * Fix variable type in demo data (:pr:`37`)
    * Custom primitive kwarg fix (:pr:`38`)
    * Changed order and text of arguments in make_trans_primitive docstring (:pr:`42`)

v0.1.14 Nov 20, 2017
====================
    * Last time index (:pr:`33`)
    * Update Scipy version to 1.0.0 (:pr:`31`)


v0.1.13 Nov 1, 2017
===================
    * Add MANIFEST.in (:pr:`26`)

v0.1.11 Oct 31, 2017
====================
    * Package linting (:pr:`7`)
    * Custom primitive creation functions (:pr:`13`)
    * Split requirements to separate files and pin to latest versions (:pr:`15`)
    * Select low information features (:pr:`18`)
    * Fix docs typos (:pr:`19`)
    * Fixed Diff primitive for rare nan case (:pr:`21`)
    * added some mising doc strings (:pr:`23`)
    * Trend fix (:pr:`22`)
    * Remove as_dir=False option from EntitySet.to_pickle() (:pr:`20`)
    * Entity Normalization Preserves Types of Copy & Additional Variables (:pr:`25`)

v0.1.10 Oct 12, 2017
====================
    * NumTrue primitive added and docstring of other primitives updated (:pr:`11`)
    * fixed hash issue with same base features (:pr:`8`)
    * Head fix (:pr:`9`)
    * Fix training window (:pr:`10`)
    * Add associative attribute to primitives (:pr:`3`)
    * Add status badges, fix license in setup.py (:pr:`1`)
    * fixed head printout and flight demo index (:pr:`2`)

v0.1.9 Sep 8, 2017
==================
    * Documentation improvements
    * New ``featuretools.demo.load_mock_customer`` function

v0.1.8 Sep 1, 2017
==================
    * Bug fixes
    * Added ``Percentile`` transform primitive

v0.1.7 Aug 17, 2017
===================
    * Performance improvements for approximate in ``calculate_feature_matrix`` and ``dfs``
    * Added ``Week`` transform primitive

v0.1.6 Jul 26, 2017
===================
    * Added ``load_features`` and ``save_features`` to persist and reload features
    * Added save_progress argument to ``calculate_feature_matrix``
    * Added approximate parameter to ``calculate_feature_matrix`` and ``dfs``
    * Added ``load_flight`` to ft.demo

v0.1.5 Jul 11, 2017
===================
    * Windows support

v0.1.3 Jul 10, 2017
===================
    * Renamed feature submodule to primitives
    * Renamed prediction_entity arguments to target_entity
    * Added training_window parameter to ``calculate_feature_matrix``

v0.1.2 Jul 3rd, 2017
====================
    * Initial release

.. command
.. git log --pretty=oneline --abbrev-commit


================================================
FILE: docs/source/resources/ecosystem.rst
================================================
:description: A list of libraries, use cases / demos, and tutorials that leverage Featuretools

===============================
Featuretools External Ecosystem
===============================

New projects are regularly being built on top of Featuretools, highlighting the importance of automated feature engineering. On this page, we have a list of libraries, use cases / demos, and tutorials that leverage Featuretools. If you would like to add a project, please contact us or submit a pull request on `GitHub`_.

.. _`GitHub`: https://github.com/alteryx/featuretools

.. note::

    We are proud and excited to share the work of people using Featuretools, but we cannot endorse or provide support for the tools on this page.

---------
Libraries
---------


`MLBlocks`_
===========
- MLBlocks is a simple framework for composing end-to-end tunable Machine Learning Pipelines by seamlessly combining tools from any python library with a simple, common and uniform interface. MLBlocks contains a primitive that uses Featuretools.

.. _`MLBlocks`: https://github.com/HDI-Project/MLBlocks

`Cardea`_
=========
- Cardea is a machine learning library built on top of the FHIR data schema. It uses a number of **automl** tools, including Featuretools.

.. _`Cardea`: https://github.com/D3-AI/Cardea

-----------------
Demos & Use Cases
-----------------
`Predict customer lifetime value`_
==================================
- A common use case for machine learning is to predict customer lifetime value. This article walks through the importance of this prediction problem using Featuretools in the process.

.. _`Predict customer lifetime value`: https://towardsdatascience.com/automating-interpretable-feature-engineering-for-predicting-clv-87ece7da9b36

`Predict NHL playoff matches`_
==============================
- Many users of `Kaggle`_ are eager to use Featuretools to improve their model performance. In this blog post, a Kaggle user takes a dataset of plays from National Hockey League games and creates a model to predict if a game is a playoff match.

.. _`Predict NHL playoff matches`: https://towardsdatascience.com/automated-feature-engineering-for-predictive-modeling-d8c9fa4e478b
.. _`Kaggle`: https://www.kaggle.com/

`Predict poverty of households in Costa Rica`_
==============================================
- Social programs have a difficult time determining the right people to give aid. Using a dataset of Costa Rican household characteristics, this Kaggle kernel predicts the poverty of households.

.. _`Predict poverty of households in Costa Rica`: https://www.kaggle.com/willkoehrsen/featuretools-for-good

`Predicting Functional Threshold Power (FTP)`_
==============================================
- This notebook and accompanying report evaluates the use of machine learning for predicting a cyclist’s FTP using data collected from previous training sessions. Featuretools is used to generate a set of independent variables that capture changes in performance over time.

.. _`Predicting Functional Threshold Power (FTP)`: https://github.com/jrkinley/ftp_proba

.. note::

    For more demos written by `Feature Labs <https://www.featurelabs.com>`_, see `featuretools.com/demos <https://www.featuretools.com/demos/>`_

---------
Tutorials
---------
`Automated Feature Engineering in Python`_
==========================================
- This article provides a walk-through of how to use a retail dataset with DFS.

.. _`Automated Feature Engineering in Python`: https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219

`A Hands-On Guide to Automated Feature Engineering`_
====================================================
- A **in-depth** tutorial that works through using Featuretools to predict future product sales at "BigMart".

.. _`A Hands-On Guide to Automated Feature Engineering`: https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/

`Introduction to Automated Feature Engineering Using DFS`_
==========================================================
- This article demonstrates using Featuretools helps automate the manual process of feature engineering on a dataset of home loans.

.. _`Introduction to Automated Feature Engineering Using DFS`: https://heartbeat.fritz.ai/introduction-to-automated-feature-engineering-using-deep-feature-synthesis-dfs-3feb69a7c00b

`Automated Feature Engineering Workshop`_
=========================================
- An automated feature engineering workshop using Featuretools hosted at the 2017 Data Summer Conference.

.. _`Automated Feature Engineering Workshop`: https://github.com/fred-navruzov/featuretools-workshop

`Tutorial in Japanese`_
=======================
- A tutorial of Featuretools that demonstrates integrating with the feature selection library `Boruta`_ and the hyper parameter tuning library `Optuna`_.

.. _`Tutorial in Japanese`: https://dev.classmethod.jp/machine-learning/yoshim-featuretools-boruta-optuna/
.. _`Optuna`: https://github.com/pfnet/optuna
.. _`Boruta`: https://github.com/scikit-learn-contrib/boruta_py

`Building a Churn Prediction Model using Featuretools`_
=======================================================
- A video tutorial that shows how to build a churn prediction model using Featuretools along with `Spark`_, `XGBoost`_, and `Google Cloud Platform`_.

.. _`Building a Churn Prediction Model using Featuretools`: https://youtu.be/ZwwneZ6iU3Y
.. _`Spark`: https://spark.apache.org/
.. _`XGBoost`: https://github.com/dmlc/xgboost
.. _`Google Cloud Platform`: https://cloud.google.com/

`Automated Feature Engineering Workshop in Russian`_
====================================================
- A video tutorial that shows how to predict if an applicant is capable of repaying a loan using Featuretools.

.. _`Automated Feature Engineering Workshop in Russian`: https://youtu.be/R0-mnamKxqY


================================================
FILE: docs/source/resources/frequently_asked_questions.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Frequently Asked Questions\n",
    "\n",
    "Here we are attempting to answer some commonly asked questions that appear on Github, and Stack Overflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import woodwork as ww\n",
    "\n",
    "import featuretools as ft"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## EntitySet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How do I get a list of column names and types in an `EntitySet`?\n",
    "\n",
    "After you create your `EntitySet`, you may wish to view the column names. An `EntitySet` contains multiple DataFrames, one for each table in the `EntitySet`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = ft.demo.load_mock_customer(return_entityset=True)\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to view the underlying Dataframe, you can do the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"transactions\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want view the columns and types for the \"transactions\" DataFrame, you can do the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"transactions\"].ww"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is the difference between `copy_columns` and `additional_columns`?\n",
    "The function `normalize_dataframe` creates a new DataFrame and a relationship from unique values of an existing DataFrame. It takes 2 similar arguments:\n",
    "\n",
    "- `additional_columns` removes columns from the base DataFrame and moves them to the new DataFrame. \n",
    "- `copy_columns` keeps the given columns in the base DataFrame, but also copies them to the new DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = ft.demo.load_mock_customer()\n",
    "transactions_df = data[\"transactions\"].merge(data[\"sessions\"]).merge(data[\"customers\"])\n",
    "products_df = data[\"products\"]\n",
    "\n",
    "es = ft.EntitySet(id=\"customer_data\")\n",
    "es = es.add_dataframe(\n",
    "    dataframe_name=\"transactions\",\n",
    "    dataframe=transactions_df,\n",
    "    index=\"transaction_id\",\n",
    "    time_index=\"transaction_time\",\n",
    ")\n",
    "\n",
    "es = es.add_dataframe(\n",
    "    dataframe_name=\"products\", dataframe=products_df, index=\"product_id\"\n",
    ")\n",
    "\n",
    "es = es.add_relationship(\"products\", \"product_id\", \"transactions\", \"product_id\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we normalize to create a new DataFrame, let's look at the base DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"transactions\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice the columns `session_id`, `session_start`, `join_date`, `device`, `customer_id`, and `zip_code`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = es.normalize_dataframe(\n",
    "    base_dataframe_name=\"transactions\",\n",
    "    new_dataframe_name=\"sessions\",\n",
    "    index=\"session_id\",\n",
    "    make_time_index=\"session_start\",\n",
    "    additional_columns=[\"join_date\"],\n",
    "    copy_columns=[\"device\", \"customer_id\", \"zip_code\", \"session_start\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we normalized the columns to create a new DataFrame. \n",
    "\n",
    "- For `additional_columns`, the following column `['join_date]` will be removed from the `transactions` DataFrame, and moved to the new `sessions` DataFrame. \n",
    "\n",
    "- For `copy_columns`, the following columns `['device', 'customer_id', 'zip_code','session_start']` will be copied from the `transactions` DataFrame to the new `sessions` DataFrame. \n",
    "\n",
    "Let's see this in the actual `EntitySet`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"transactions\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice above how `['device', 'customer_id', 'zip_code','session_start']` are still in the `transactions` DataFrame, while `['join_date']` is not. But, they have all been moved to the `sessions` DataFrame, as seen below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"sessions\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Why did my columns get new semantic tags?\n",
    "\n",
    "During the creation of your `EntitySet`, you might be wondering why the semantic tags in your columns change."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = ft.demo.load_mock_customer()\n",
    "transactions_df = data[\"transactions\"].merge(data[\"sessions\"]).merge(data[\"customers\"])\n",
    "products_df = data[\"products\"]\n",
    "\n",
    "es = ft.EntitySet(id=\"customer_data\")\n",
    "es = es.add_dataframe(\n",
    "    dataframe_name=\"transactions\",\n",
    "    dataframe=transactions_df,\n",
    "    index=\"transaction_id\",\n",
    "    time_index=\"transaction_time\",\n",
    ")\n",
    "es.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If a column contains semantic tags, they will appear on the right side of a semicolon in the plot above. Notice how `session_id` and `session_start` do not have any semantic tags currently associated to them.\n",
    "\n",
    "Now, let's normalize the transactions DataFrame to create a new DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = es.normalize_dataframe(\n",
    "    base_dataframe_name=\"transactions\",\n",
    "    new_dataframe_name=\"sessions\",\n",
    "    index=\"session_id\",\n",
    "    make_time_index=\"session_start\",\n",
    "    additional_columns=[\"session_start\"],\n",
    ")\n",
    "es.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `session_id` now has the sematic tag `foreign_key` in the `transactions` DataFrame, and `index` in the new DataFrame, `sessions`. This is the case because when we normalize the DataFrame, we create a new relationship between the `transactions` and `sessions`. There is a one to many relationship between the parent DataFrame, `sessions`, and child DataFrame, `transactions`.\n",
    "\n",
    "Therefore, `session_id` has the semantic tag `foreign_key` in `transactions` because it represents an `index` in another DataFrame. There would be a similar effect if we added another DataFrame using `add_dataframe` and `add_relationship`. \n",
    "\n",
    "In addition, when we created the new DataFrame, we set `session_start` as the `time_index`. This added the semantic tag `time_index` to the `session_start` column in the new `sessions` DataFrame because it now represents a `time_index`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How do I update a column's description or metadata?\n",
    "\n",
    "You can directly update the description or metadata attributes of the column schema. However, you must specifically use the column schema returned by `DataFrame.ww.columns['col_name']`, **not** `DataFrame.ww['col_name'].ww.schema`. The column schema from `DataFrame.ww.columns['col_name']` is still associated with the EntitySet and propagates any attribute updates, whereas the other does not. As an example, this is how you can update a column's description or metadata:\n",
    "\n",
    "```python\n",
    "column_schema = df.ww.columns['col_name']\n",
    "column_schema.description = 'my description'\n",
    "column_schema.metadata.update(key='value')\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How do I combine two or more interesting values?\n",
    "\n",
    "You might want to create features that are conditioned on multiple values before they are calculated. This would require the use of `interesting_values`. However, since we are trying to create the feature with multiple conditions, we will need to modify the Dataframe before we create the `EntitySet`.\n",
    "\n",
    "Let's look at how you might accomplish this. \n",
    "\n",
    "First, let's create our Dataframes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = ft.demo.load_mock_customer()\n",
    "transactions_df = data[\"transactions\"].merge(data[\"sessions\"]).merge(data[\"customers\"])\n",
    "products_df = data[\"products\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transactions_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "products_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's modify our `transactions` Dataframe to create the additional column that represents multiple conditions for our feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transactions_df[\"product_id_device\"] = (\n",
    "    transactions_df[\"product_id\"].astype(str) + \" and \" + transactions_df[\"device\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we created a new column called `product_id_device`, which just combines the `product_id` column, and the `device` column.\n",
    "\n",
    "Now let's create our `EntitySet`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = ft.EntitySet(id=\"customer_data\")\n",
    "es = es.add_dataframe(\n",
    "    dataframe_name=\"transactions\",\n",
    "    dataframe=transactions_df,\n",
    "    index=\"transaction_id\",\n",
    "    time_index=\"transaction_time\",\n",
    "    logical_types={\n",
    "        \"product_id\": ww.logical_types.Categorical,\n",
    "        \"product_id_device\": ww.logical_types.Categorical,\n",
    "        \"zip_code\": ww.logical_types.PostalCode,\n",
    "    },\n",
    ")\n",
    "\n",
    "es = es.add_dataframe(\n",
    "    dataframe_name=\"products\", dataframe=products_df, index=\"product_id\"\n",
    ")\n",
    "\n",
    "es = es.normalize_dataframe(\n",
    "    base_dataframe_name=\"transactions\",\n",
    "    new_dataframe_name=\"sessions\",\n",
    "    index=\"session_id\",\n",
    "    additional_columns=[\"device\", \"product_id_device\", \"customer_id\"],\n",
    ")\n",
    "\n",
    "es = es.normalize_dataframe(\n",
    "    base_dataframe_name=\"sessions\", new_dataframe_name=\"customers\", index=\"customer_id\"\n",
    ")\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we are ready to add our interesting values. \n",
    "\n",
    "First, let's view our options for what the interesting values could be."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interesting_values = transactions_df[\"product_id_device\"].unique().tolist()\n",
    "interesting_values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you wanted to, you could pick a subset of these, and the `where` features created would only use those conditions. In our example, we will use all the possible interesting values.\n",
    "\n",
    "Here, we set all of these values as our interesting values for this specific DataFrame and column. If we wanted to, we could make interesting values in the same way for more than one column, but we will just stick with this one for this example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "values = {\"product_id_device\": interesting_values}\n",
    "es.add_interesting_values(dataframe_name=\"sessions\", values=values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can run DFS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"count\"],\n",
    "    where_primitives=[\"count\"],\n",
    "    trans_primitives=[],\n",
    ")\n",
    "feature_matrix.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To better understand the `where` clause features, let's examine one of those features. \n",
    "The feature `COUNT(sessions WHERE product_id_device = 5 and tablet)`, tells us how many sessions the customer purchased `product_id` 5 while on a tablet. Notice how the feature depends on multiple conditions **(product_id = 5 & device = tablet)**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix[[\"COUNT(sessions WHERE product_id_device = 5 and tablet)\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DFS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Why is DFS not creating aggregation features?\n",
    "You may have created your `EntitySet`, and then applied DFS to create features. However, you may be puzzled as to why no aggregation features were created. \n",
    "\n",
    "- **This is most likely because you have a single DataFrame in your EntitySet, and DFS is not capable of creating aggregation features with fewer than 2 DataFrames. Featuretools looks for a relationship, and aggregates based on that relationship.**\n",
    "\n",
    "Let's look at a simple example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = ft.demo.load_mock_customer()\n",
    "transactions_df = data[\"transactions\"].merge(data[\"sessions\"]).merge(data[\"customers\"])\n",
    "\n",
    "es = ft.EntitySet(id=\"customer_data\")\n",
    "es = es.add_dataframe(\n",
    "    dataframe_name=\"transactions\", dataframe=transactions_df, index=\"transaction_id\"\n",
    ")\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how we only have 1 DataFrame in our `EntitySet`. If we try to create aggregation features on this `EntitySet`, it will not be possible because DFS needs 2 DataFrames to generate aggregation features. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es, target_dataframe_name=\"transactions\"\n",
    ")\n",
    "feature_defs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "None of the above features are aggregation features. To fix this issue, you can add another DataFrame to your `EntitySet`.\n",
    "\n",
    "**Solution #1 - You can add new DataFrame if you have additional data.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "products_df = data[\"products\"]\n",
    "es = es.add_dataframe(\n",
    "    dataframe_name=\"products\", dataframe=products_df, index=\"product_id\"\n",
    ")\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how we now have an additional DataFrame in our `EntitySet`, called `products`.\n",
    "\n",
    "**Solution #2 - You can normalize an existing DataFrame.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = es.normalize_dataframe(\n",
    "    base_dataframe_name=\"transactions\",\n",
    "    new_dataframe_name=\"sessions\",\n",
    "    index=\"session_id\",\n",
    "    make_time_index=\"session_start\",\n",
    "    additional_columns=[\"device\", \"customer_id\", \"zip_code\", \"join_date\"],\n",
    "    copy_columns=[\"session_start\"],\n",
    ")\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how we now have an additional DataFrame in our `EntitySet`, called `sessions`. Here, the normalization created a relationship between `transactions` and `sessions`. However, we could have specified a relationship between `transactions` and `products` if we had only used Solution \\#1.\n",
    "\n",
    "Now, we can generate aggregation features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es, target_dataframe_name=\"transactions\"\n",
    ")\n",
    "feature_defs[:-10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A few of the aggregation features are:\n",
    "\n",
    "- `<Feature: sessions.MAX(transactions.amount)>`\n",
    "- `<Feature: sessions.SKEW(transactions.amount)>`\n",
    "- `<Feature: sessions.MIN(transactions.amount)>`\n",
    "- `<Feature: sessions.MEAN(transactions.amount)>`\n",
    "- `<Feature: sessions.COUNT(transactions)>`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How do I speed up the runtime of DFS?\n",
    "\n",
    "One issue you may encounter while running `ft.dfs` is slow performance. While Featuretools has generally optimal default settings for calculating features, you may want to speed up performance when you are calculating on a large number of features. \n",
    "\n",
    "One quick way to speed up performance is by adjusting the `n_jobs` settings of `ft.dfs` or `ft.calculate_feature_matrix`.\n",
    "\n",
    "```python\n",
    "# setting n_jobs to -1 will use all cores\n",
    "\n",
    "feature_matrix, feature_defs = ft.dfs(entityset=es,\n",
    "                                      target_dataframe_name=\"customers\",\n",
    "                                      n_jobs=-1)\n",
    "\n",
    "                                      \n",
    "feature_matrix, feature_defs = ft.calculate_feature_matrix(entityset=es,\n",
    "                                                           features=feature_defs,\n",
    "                                                           n_jobs=-1)\n",
    "```\n",
    "\n",
    "\n",
    "**For more ways to speed up performance, please visit:**\n",
    "\n",
    "- [Improving Computational Performance](../guides/performance.ipynb#improving-computational-performance)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How do I include only certain features when running DFS?\n",
    "\n",
    "When using DFS to generate features, you may wish to include only certain features. There are multiple ways that you do this:\n",
    "\n",
    "- Use `ignore_columns` to specify columns in a DataFrame that should not be used to create features. It is a dictionary mapping dataframe names to a list of column names to ignore.\n",
    "\n",
    "- Use `drop_contains` to drop features that contain any of the strings listed in this parameter.\n",
    "\n",
    "- Use `drop_exact` to drop features that exactly match any of the strings listed in this parameter.\n",
    "\n",
    "Here is an example of using all three parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = ft.demo.load_mock_customer(return_entityset=True)\n",
    "\n",
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    ignore_columns={\n",
    "        \"transactions\": [\"amount\"],\n",
    "        \"customers\": [\"age\", \"gender\", \"birthday\"],\n",
    "    },  # ignore these columns\n",
    "    drop_contains=[\"customers.SUM(\"],  # drop features that contain these strings\n",
    "    drop_exact=[\"STD(transactions.quanity)\"],\n",
    ")  # drop features that exactly match"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How do I specify primitives on a per column or per DataFrame basis?\n",
    "\n",
    "When using DFS to generate features, you may wish to use only certain features or DataFrames for specific primitives. This can be done through the `primitive_options` parameter. The `primitive_options` parameter is a dictionary that maps a primitive or a tuple of primitives to a dictionary containing options for the primitive(s). A primitive or tuple of primitives can also be mapped to a list of option dictionaries if the primitive(s) \n",
    "takes multiple inputs. The primitive keys can be the string names of the primitive, the primitive class, or specific instances of the primitive. Each dictionary supplies options for their respective input column. There are multiple ways to control how primitives get applied through these options:\n",
    "\n",
    "- Use `ignore_dataframes` to specify DataFrames that should not be used to create features for that primitive. It is a list of DataFrame names to ignore.\n",
    "\n",
    "- Use `include_dataframes` to specify the only DataFrames to be included to create features for that primitive. It is a list of DataFrame names to include.\n",
    "\n",
    "- Use `ignore_columns` to specify columns in a DataFrame that should not be used to create features for that primitive. It is a dictionary mapping a DataFrame name to a list of column names to ignore.\n",
    "\n",
    "- Use `include_columns` to specify the only columns in a DataFrame that should be used to create features for that primitive. It is a dictionary mapping a DataFrame name to a list of column names to include.\n",
    "\n",
    "You can also use `primitive_options` to specify which DataFrames or columns you wish to use as groupbys for groupby transformation primitives:\n",
    "\n",
    "- Use `ignore_groupby_dataframes` to specify DataFrames that should not be used to get groupbys for that primitive. It is a list of DataFrame names to ignore.\n",
    "\n",
    "- Use `include_groupby_dataframes` to specify the only DataFrames that should be used to get groupbys for that primitive. It is a list of DataFrame names to include.\n",
    "\n",
    "- Use `ignore_groupby_columns` to specify columns in a DataFrame that should not be used as groupbys for that primitive. It is a dictionary mapping a DataFrame name to a list of column names to ignore.\n",
    "\n",
    "- Use `include_groupby_columns` to specify the only columns in a DataFrame that should be used as groupbys for that primitive. It is a dictionary mapping a DataFrame name to a list of column names to include.\n",
    "\n",
    "Here is an example of using some of these options:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = ft.demo.load_mock_customer(return_entityset=True)\n",
    "\n",
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    primitive_options={\n",
    "        \"mode\": {\n",
    "            \"ignore_dataframes\": [\"sessions\"],\n",
    "            \"ignore_columns\": {\"products\": [\"brand\"], \"transactions\": [\"product_id\"]},\n",
    "        },\n",
    "        # For mode, ignore the \"sessions\" DataFrame and only include \"brands\" in the\n",
    "        # \"products\" dataframe and \"product_id\" in the \"transactions\" DataFrame\n",
    "        (\"count\", \"mean\"): {\"include_dataframes\": [\"sessions\", \"transactions\"]},\n",
    "        # For count and mean, only include the dataframes \"sessions\" and \"transactions\"\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that if options are given for a specific instance of a primitive and for the primitive generally (either by string name or class), the instances with their own options will not use the generic options. For example, in this case:\n",
    "```\n",
    "special_mean = Mean()\n",
    "options = {\n",
    "    special_mean: {'include_dataframes': ['customers']},\n",
    "    'mean': {'include_dataframes': ['sessions']}\n",
    "```\n",
    "the primitive `special_mean` will not use the DataFrame `sessions` because it's options have it only include `customers`. Every other instance of the `Mean` primitive will use the `'mean'` options.  \n",
    "\n",
    "**For more examples of specifying options for DFS, please visit:**\n",
    "\n",
    "- [Specifying Primitive Options](../guides/specifying_primitive_options.rst)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### If I didn't specify the **cutoff_time**, what date will be used for the feature calculations?\n",
    "\n",
    "The cutoff time will be set to the current time using `cutoff_time = datetime.now()`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How do I select a certain amount of past data when calculating features?\n",
    "\n",
    "You may encounter a situation when you wish to make prediction using only a certain amount of historical data. You can accomplish this using the `training_window` parameter in `ft.dfs`. When you use the `training_window`, Featuretools will use the historical data between the `cutoff_time` and `cutoff_time - training_window`.\n",
    "\n",
    "In order to make the calculation, Featuretools will check the time in the `time_index` column of the `target_dataframe`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = ft.demo.load_mock_customer(return_entityset=True)\n",
    "es[\"customers\"].ww.time_index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our target_dataframe has a `time_index`, which is needed for the `training_window` calculation. Here, we are creating a cutoff time DataFrame so that we can have a unique training window for each customer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cutoff_times = pd.DataFrame()\n",
    "cutoff_times[\"customer_id\"] = [1, 2, 3, 1]\n",
    "cutoff_times[\"time\"] = pd.to_datetime(\n",
    "    [\"2014-1-1 04:00\", \"2014-1-1 05:00\", \"2014-1-1 06:00\", \"2014-1-1 08:00\"]\n",
    ")\n",
    "cutoff_times[\"label\"] = [True, True, False, True]\n",
    "\n",
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    cutoff_time=cutoff_times,\n",
    "    cutoff_time_in_index=True,\n",
    "    training_window=\"1 hour\",\n",
    ")\n",
    "feature_matrix.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we ran DFS with `training_window` argument of `1 hour` to create features that only used customer data collected in the last hour (from the cutoff time we provided)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Can I run DFS on a single table? \n",
    "\n",
    "Although possible, running DFS on a single table doesn't make full use of DFS's capabilities. For one, DFS will not be able to use any aggregation primitives, which require at least two tables. You will only be able to use transform primitives. This limits the complexity of the features that DFS can generate through feature stacking. Additionally, in certain situations, running single table DFS on data with time columns could risk label leakage. With data split in multiple tables, featuretools can filter data based on the cutoff time instead of assuming data was flattened appropriately, but it can not do this with only a single table. \n",
    "\n",
    "If you only have a single table of data, DFS can certainly still be of use. There are two main ways to pass in a single table to DFS. \n",
    "\n",
    "The first is to simply create an EntitySet with one table. \n",
    "\n",
    "For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transactions_df = ft.demo.load_mock_customer(return_single_table=True)\n",
    "\n",
    "es = ft.EntitySet(id=\"customer_data\")\n",
    "es = es.add_dataframe(\n",
    "    dataframe_name=\"transactions\",\n",
    "    dataframe=transactions_df,\n",
    "    index=\"transaction_id\",\n",
    "    time_index=\"transaction_time\",\n",
    ")\n",
    "\n",
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"transactions\",\n",
    "    trans_primitives=[\n",
    "        \"time_since\",\n",
    "        \"day\",\n",
    "        \"is_weekend\",\n",
    "        \"cum_min\",\n",
    "        \"minute\",\n",
    "        \"weekday\",\n",
    "        \"percentile\",\n",
    "        \"year\",\n",
    "        \"week\",\n",
    "        \"cum_mean\",\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The second way is to insert the dataframe into a dictionary mapping its name to a tuple containing specific dataframe information. We then pass in that dictionary to the `dataframes` argument in DFS.\n",
    "\n",
    "In this scenario, for the value in our dictionary, we pass in a tuple containing the dataframe, its index column, and its time index. More information about the possible parameters can be found in the [DFS documentation](https://featuretools.alteryx.com/en/stable/generated/featuretools.dfs.html#featuretools.dfs).\n",
    "\n",
    "For example: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transactions_df = ft.demo.load_mock_customer(return_single_table=True)\n",
    "\n",
    "dataframes = {\"transactions\": (transactions_df, \"transaction_id\", \"transaction_time\")}\n",
    "\n",
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    dataframes=dataframes,\n",
    "    target_dataframe_name=\"transactions\",\n",
    "    trans_primitives=[\n",
    "        \"time_since\",\n",
    "        \"day\",\n",
    "        \"is_weekend\",\n",
    "        \"cum_min\",\n",
    "        \"minute\",\n",
    "        \"weekday\",\n",
    "        \"percentile\",\n",
    "        \"year\",\n",
    "        \"week\",\n",
    "        \"cum_mean\",\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we examine the output, let's look at our original single table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transactions_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can look at the transformations that Featuretools was able to apply to this single DataFrame to create feature matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How do I prevent label leakage with DFS?\n",
    "\n",
    "One concern you might have with using DFS is about label leakage. You want to make sure that labels in your data aren't used incorrectly to create features and the feature matrix.\n",
    "\n",
    "**Featuretools is particularly focused on helping users avoid label leakage.**\n",
    "\n",
    "There are two ways to prevent label leakage depending on if your data has timestamps or not.\n",
    "\n",
    "#### 1. Data without timestamps\n",
    "In the case where you do not have timestamps, you can create one `EntitySet` using only the training data and then run `ft.dfs`. This will create a feature matrix using only the training data, but also return a list of feature definitions. Next, you can create an `EntitySet` using the test data and recalculate the same features by calling `ft.calculate_feature_matrix` with the list of feature definitions from before. \n",
    "\n",
    "Here is what that flow would look like:\n",
    "\n",
    "First, let's create our training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data = pd.DataFrame(\n",
    "    {\n",
    "        \"customer_id\": [1, 2, 3, 4, 5],\n",
    "        \"age\": [40, 50, 10, 20, 30],\n",
    "        \"gender\": [\"m\", \"f\", \"m\", \"f\", \"f\"],\n",
    "        \"signup_date\": pd.date_range(\"2014-01-01 01:41:50\", periods=5, freq=\"25min\"),\n",
    "        \"labels\": [True, False, True, False, True],\n",
    "    }\n",
    ")\n",
    "train_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can create an entityset for our training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es_train_data = ft.EntitySet(id=\"customer_train_data\")\n",
    "es_train_data = es_train_data.add_dataframe(\n",
    "    dataframe_name=\"customers\", dataframe=train_data, index=\"customer_id\"\n",
    ")\n",
    "es_train_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we are ready to create our features, and feature matrix for the training data.  We don't want Featuretools to use the labels column to build new features, so we will use the ``ignore_columns`` option to exclude it.  This would also remove the labels column from the feature matrix, so we will tell DFS to include it as a seed feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels_feature = ft.Feature(es_train_data[\"customers\"].ww[\"labels\"])\n",
    "feature_matrix_train, feature_defs = ft.dfs(\n",
    "    entityset=es_train_data,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    ignore_columns={\"customers\": [\"labels\"]},\n",
    "    seed_features=[labels_feature],\n",
    ")\n",
    "feature_matrix_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will also encode our feature matrix to make machine learning compatible features. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix_train_enc, features_enc = ft.encode_features(\n",
    "    feature_matrix_train, feature_defs\n",
    ")\n",
    "feature_matrix_train_enc.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how the whole feature matrix only includes numeric and boolean values now."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can use the feature definitions to calculate our feature matrix for the test data, and avoid label leakage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_train = pd.DataFrame(\n",
    "    {\n",
    "        \"customer_id\": [6, 7, 8, 9, 10],\n",
    "        \"age\": [20, 25, 55, 22, 35],\n",
    "        \"gender\": [\"f\", \"m\", \"m\", \"m\", \"m\"],\n",
    "        \"signup_date\": pd.date_range(\"2014-01-01 01:41:50\", periods=5, freq=\"25min\"),\n",
    "        \"labels\": [True, False, False, True, True],\n",
    "    }\n",
    ")\n",
    "\n",
    "es_test_data = ft.EntitySet(id=\"customer_test_data\")\n",
    "es_test_data = es_test_data.add_dataframe(\n",
    "    dataframe_name=\"customers\",\n",
    "    dataframe=test_train,\n",
    "    index=\"customer_id\",\n",
    "    time_index=\"signup_date\",\n",
    ")\n",
    "\n",
    "# Use the feature definitions from earlier\n",
    "feature_matrix_enc_test = ft.calculate_feature_matrix(\n",
    "    features=features_enc, entityset=es_test_data\n",
    ")\n",
    "\n",
    "feature_matrix_enc_test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check out the [Modeling](frequently_asked_questions.ipynb#Modeling) section for an example of using the encoded matrix with sklearn."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Data with timestamps\n",
    "\n",
    "If your data has timestamps, the best way to prevent label leakage is to use a list of **cutoff times**, which specify the last point in time data is allowed to be used for each row in the resulting feature matrix. To use **cutoff times**, you need to set a time index for each time sensitive DataFrame in your entity set.\n",
    "\n",
    "> **Tip: Even if your data doesn’t have time stamps, you could add a column with dummy timestamps that can be used by Featuretools as time index.**\n",
    "\n",
    "When you call `ft.dfs`, you can provide a DataFrame of cutoff times like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cutoff_times = pd.DataFrame(\n",
    "    {\n",
    "        \"customer_id\": [1, 2, 3, 4, 5],\n",
    "        \"time\": pd.date_range(\"2014-01-01 01:41:50\", periods=5, freq=\"25min\"),\n",
    "    }\n",
    ")\n",
    "cutoff_times.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_test_data = pd.DataFrame(\n",
    "    {\n",
    "        \"customer_id\": [1, 2, 3, 4, 5],\n",
    "        \"age\": [20, 25, 55, 22, 35],\n",
    "        \"gender\": [\"f\", \"m\", \"m\", \"m\", \"m\"],\n",
    "        \"signup_date\": pd.date_range(\"2010-01-01 01:41:50\", periods=5, freq=\"25min\"),\n",
    "    }\n",
    ")\n",
    "\n",
    "es_train_test_data = ft.EntitySet(id=\"customer_train_test_data\")\n",
    "es_train_test_data = es_train_test_data.add_dataframe(\n",
    "    dataframe_name=\"customers\",\n",
    "    dataframe=train_test_data,\n",
    "    index=\"customer_id\",\n",
    "    time_index=\"signup_date\",\n",
    ")\n",
    "\n",
    "feature_matrix_train_test, features = ft.dfs(\n",
    "    entityset=es_train_test_data,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    cutoff_time=cutoff_times,\n",
    "    cutoff_time_in_index=True,\n",
    ")\n",
    "feature_matrix_train_test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we have created a feature matrix that uses cutoff times to avoid label leakage. We could also encode this feature matrix using `ft.encode_features`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is the difference between passing a primitive object versus a string to DFS?  \n",
    "\n",
    "There are 2 ways to pass primitives to DFS: the primitive object, or a string of the primitive name. \n",
    "\n",
    "We will use the Transform primitive called `TimeSincePrevious` to illustrate the differences.\n",
    "\n",
    "First, let's use the string of primitive name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = ft.demo.load_mock_customer(return_entityset=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[],\n",
    "    trans_primitives=[\"time_since_previous\"],\n",
    ")\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's use the primitive object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from featuretools.primitives import TimeSincePrevious\n",
    "\n",
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[],\n",
    "    trans_primitives=[TimeSincePrevious],\n",
    ")\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see above, the feature matrix is the same.\n",
    "\n",
    "However, if we need to modify controllable parameters in the primitive, we should use the primitive object. \n",
    "For instance, let's make TimeSincePrevious return units of hours (the default is in seconds)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from featuretools.primitives import TimeSincePrevious\n",
    "\n",
    "time_since_previous_in_hours = TimeSincePrevious(unit=\"hours\")\n",
    "\n",
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[],\n",
    "    trans_primitives=[time_since_previous_in_hours],\n",
    ")\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How can I select features based on some attributes (a specific string, an explicit primitive type, a return type, a given depth)?\n",
    "\n",
    "You may wish to select a subset of your features based on some attributes. \n",
    "\n",
    "Let's say you wanted to select features that had the string `amount` in its name. You can check for this by using the `get_name` function on the feature definitions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = ft.demo.load_mock_customer(return_entityset=True)\n",
    "\n",
    "feature_defs = ft.dfs(\n",
    "    entityset=es, target_dataframe_name=\"customers\", features_only=True\n",
    ")\n",
    "\n",
    "features_with_amount = []\n",
    "for x in feature_defs:\n",
    "    if \"amount\" in x.get_name():\n",
    "        features_with_amount.append(x)\n",
    "features_with_amount[0:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might also want to only select features that are aggregation features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from featuretools import AggregationFeature\n",
    "\n",
    "features_only_aggregations = []\n",
    "for x in feature_defs:\n",
    "    if type(x) == AggregationFeature:\n",
    "        features_only_aggregations.append(x)\n",
    "features_only_aggregations[0:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also, you might only want to select features that are calculated at a certain depth. You can do this by using the `get_depth` function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features_only_depth_2 = []\n",
    "for x in feature_defs:\n",
    "    if x.get_depth() == 2:\n",
    "        features_only_depth_2.append(x)\n",
    "features_only_depth_2[0:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, you might only want features that return a certain type. You can do this by using the `column_schema` attribute. For more information on working with column schemas, take a look at [Transitioning from Variables to Woodwork](transition_to_ft_v1.0.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features_only_numeric = []\n",
    "for x in feature_defs:\n",
    "    if \"numeric\" in x.column_schema.semantic_tags:\n",
    "        features_only_numeric.append(x)\n",
    "features_only_numeric[0:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have your specific feature list, you can use `ft.calculate_feature_matrix` to generate a feature matrix for only those features.\n",
    "\n",
    "For our example, let's use the features with only the string `amount` in its name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix = ft.calculate_feature_matrix(\n",
    "    entityset=es, features=features_with_amount\n",
    ")  # change to your specific feature list\n",
    "feature_matrix.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, notice how all the column names for our feature matrix contain the string `amount`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How do I create **where** features?\n",
    "\n",
    "Sometimes, you might want to create features that are conditioned on a second value before it is calculated. This extra filter is called a “where clause”. You can create these features using the using the `interesting_values` of a column.\n",
    "\n",
    "If you have categorical columns in your `EntitySet`, you can use `add_interesting_values`. This function will  find interesting values for your categorical columns, which can then be used to generate “where” clauses.\n",
    "\n",
    "First, let's create our `EntitySet`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = ft.demo.load_mock_customer(return_entityset=True)\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can add the interesting values for the categorical column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es.add_interesting_values()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can run DFS with the `where_primitives` argument to define which primitives to apply with where clauses. In this case, let's use the primitive `count`. For this to work, the primitive `count` must be present in both `agg_primitives` and `where_primitives`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[\"count\"],\n",
    "    where_primitives=[\"count\"],\n",
    "    trans_primitives=[],\n",
    ")\n",
    "feature_matrix.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have now created some useful features. One example of a useful feature is the `COUNT(sessions WHERE device = tablet)`. This feature tells us how many sessions a customer completed on a tablet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix[[\"COUNT(sessions WHERE device = tablet)\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Primitives"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is the difference between the primitive types (Transform, GroupBy Transform, & Aggregation)?\n",
    "\n",
    "You might curious to know the difference between the primitive groups.\n",
    "Let's review the differences between transform, groupby transform, and aggregation primitives.\n",
    "\n",
    "First, let's create a simple `EntitySet`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "import featuretools as ft\n",
    "\n",
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"id\": [1, 2, 3, 4, 5, 6],\n",
    "        \"time_index\": pd.date_range(\"1/1/2019\", periods=6, freq=\"D\"),\n",
    "        \"group\": [\"a\", \"a\", \"a\", \"a\", \"a\", \"a\"],\n",
    "        \"val\": [5, 1, 10, 20, 6, 23],\n",
    "    }\n",
    ")\n",
    "es = ft.EntitySet()\n",
    "es = es.add_dataframe(\n",
    "    dataframe_name=\"observations\", dataframe=df, index=\"id\", time_index=\"time_index\"\n",
    ")\n",
    "\n",
    "es = es.normalize_dataframe(\n",
    "    base_dataframe_name=\"observations\", new_dataframe_name=\"groups\", index=\"group\"\n",
    ")\n",
    "\n",
    "es.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After calling `normalize_dataframe`, the column \"group\" has the semantic tag \"foreign_key\" because it identifies another DataFrame. Alternatively, it could be set using the `semantic_tags` parameter when we first call `es.add_dataframe()`.\n",
    "\n",
    "#### Transform Primitive\n",
    "\n",
    "The cum_sum primitive calculates the running sum in list of numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from featuretools.primitives import CumSum\n",
    "\n",
    "cum_sum = CumSum()\n",
    "cum_sum([1, 2, 3, 4, 5]).tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we apply it using the `trans_primitives` argument it will calculate it over the entire observations DataFrame like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    target_dataframe_name=\"observations\",\n",
    "    entityset=es,\n",
    "    agg_primitives=[],\n",
    "    trans_primitives=[\"cum_sum\"],\n",
    "    groupby_trans_primitives=[],\n",
    ")\n",
    "\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Groupby Transform Primitive\n",
    "\n",
    "If we apply it using `groupby_trans_primitives`, then DFS will first group by any foreign key columns before applying the transform primitive. As a result, we get the cumulative sum by group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    target_dataframe_name=\"observations\",\n",
    "    entityset=es,\n",
    "    agg_primitives=[],\n",
    "    trans_primitives=[],\n",
    "    groupby_trans_primitives=[\"cum_sum\"],\n",
    ")\n",
    "\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Aggregation Primitive\n",
    "\n",
    "Finally, there is also the aggregation primitive \"sum\". If we use sum, it will calculate the sum for the group at the cutoff time for each row. Because we didn't specify a cutoff time it will use all the data for each group for each row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    target_dataframe_name=\"observations\",\n",
    "    entityset=es,\n",
    "    agg_primitives=[\"sum\"],\n",
    "    trans_primitives=[],\n",
    "    cutoff_time_in_index=True,\n",
    "    groupby_trans_primitives=[],\n",
    ")\n",
    "\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we set the cutoff time of each row to be the time index, then use sum as an aggregation primitive, the result is the same as cum_sum. (Though the order is different in the displayed dataframe)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cutoff_time = df[[\"id\", \"time_index\"]]\n",
    "cutoff_time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    target_dataframe_name=\"observations\",\n",
    "    entityset=es,\n",
    "    agg_primitives=[\"sum\"],\n",
    "    trans_primitives=[],\n",
    "    groupby_trans_primitives=[],\n",
    "    cutoff_time_in_index=True,\n",
    "    cutoff_time=cutoff_time,\n",
    ")\n",
    "\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How do I get a list of all Aggregation and Transform primitives?\n",
    "\n",
    "You can do `featuretools.list_primitives()` to get all the primitive in Featuretools. It will return a DataFrame with the names, type, and description of the primitives."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_primitives = ft.list_primitives()\n",
    "df_primitives.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_primitives.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How do I change the units for a TimeSince primitive?\n",
    "There are a few primitives in Featuretools that make some time-based calculation. These include `TimeSince, TimeSincePrevious, TimeSinceLast, TimeSinceFirst`. \n",
    "\n",
    "You can change the units from the default seconds to any valid time unit, by doing the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from featuretools.primitives import (\n",
    "    TimeSince,\n",
    "    TimeSinceFirst,\n",
    "    TimeSinceLast,\n",
    "    TimeSincePrevious,\n",
    ")\n",
    "\n",
    "time_since = TimeSince(unit=\"minutes\")\n",
    "time_since_previous = TimeSincePrevious(unit=\"hours\")\n",
    "time_since_last = TimeSinceLast(unit=\"days\")\n",
    "time_since_first = TimeSinceFirst(unit=\"years\")\n",
    "\n",
    "es = ft.demo.load_mock_customer(return_entityset=True)\n",
    "\n",
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    agg_primitives=[time_since_last, time_since_first],\n",
    "    trans_primitives=[time_since, time_since_previous],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we changed the units to the following:\n",
    "- minutes for `TimeSince`\n",
    "- hours for `TimeSincePrevious`\n",
    "- days for `TimeSinceLast`\n",
    "- years for `TimeSinceFirst`.\n",
    "\n",
    "\n",
    "Now we can see that our feature matrix contains multiple features where the units for the TimeSince primitives are changed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are now features where time unit is different from the default of seconds, such as `TIME_SINCE_LAST(sessions.session_start, unit=days)`, and `TIME_SINCE_FIRST(sessions.session_start, unit=years)`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How does my train & test data work with Featuretools and sklearn's **train_test_split**?\n",
    "\n",
    "You might be wondering how to properly use your train & test data with Featuretools, and sklearn's **train_test_split**. There are a few things you must do to ensure accuracy with this workflow.\n",
    "\n",
    "Let's imagine we have a Dataframes for our train data, with the labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data = pd.DataFrame(\n",
    "    {\n",
    "        \"customer_id\": [1, 2, 3, 4, 5],\n",
    "        \"age\": [20, 25, 55, 22, 35],\n",
    "        \"gender\": [\"f\", \"m\", \"m\", \"m\", \"m\"],\n",
    "        \"signup_date\": pd.date_range(\"2010-01-01 01:41:50\", periods=5, freq=\"25min\"),\n",
    "        \"labels\": [False, True, True, False, False],\n",
    "    }\n",
    ")\n",
    "train_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can create our `EntitySet` for the train data, and create our features. To prevent label leakage, we will use cutoff times (see [earlier question](#How-do-I-prevent-label-leakage-with-DFS?))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es_train_data = ft.EntitySet(id=\"customer_data\")\n",
    "es_train_data = es_train_data.add_dataframe(\n",
    "    dataframe_name=\"customers\", dataframe=train_data, index=\"customer_id\"\n",
    ")\n",
    "\n",
    "cutoff_times = pd.DataFrame(\n",
    "    {\n",
    "        \"customer_id\": [1, 2, 3, 4, 5],\n",
    "        \"time\": pd.date_range(\"2014-01-01 01:41:50\", periods=5, freq=\"25min\"),\n",
    "    }\n",
    ")\n",
    "\n",
    "feature_matrix_train, features = ft.dfs(\n",
    "    entityset=es_train_data,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    cutoff_time=cutoff_times,\n",
    "    cutoff_time_in_index=True,\n",
    ")\n",
    "feature_matrix_train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will also encode our feature matrix to compatible for machine learning algorithms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix_train_enc, feature_enc = ft.encode_features(\n",
    "    feature_matrix_train, features\n",
    ")\n",
    "feature_matrix_train_enc.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X = feature_matrix_train_enc.drop([\"labels\"], axis=1)\n",
    "y = feature_matrix_train_enc[\"labels\"]\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can use the encoded feature matrix with sklearn's **train_test_split**. This will allow you to train your model, and tune your parameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How are categorical columns encoded when splitting training and testing data?\n",
    "\n",
    "You might be wondering what happens when categorical columns are encoded with your training and testing data. You might be curious to know what happens if the train data has a categorical column that is not present in the testing data. \n",
    "\n",
    "Let's explore a simple example to see what happens during the encoding process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data = pd.DataFrame(\n",
    "    {\n",
    "        \"customer_id\": [1, 2, 3, 4, 5],\n",
    "        \"product_purchased\": [\"coke zero\", \"car\", \"toothpaste\", \"coke zero\", \"car\"],\n",
    "    }\n",
    ")\n",
    "es_train = ft.EntitySet(id=\"customer_data\")\n",
    "es_train = es_train.add_dataframe(\n",
    "    dataframe_name=\"customers\",\n",
    "    dataframe=train_data,\n",
    "    index=\"customer_id\",\n",
    "    logical_types={\"product_purchased\": ww.logical_types.Categorical},\n",
    ")\n",
    "feature_matrix_train, features = ft.dfs(\n",
    "    entityset=es_train, target_dataframe_name=\"customers\"\n",
    ")\n",
    "feature_matrix_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use `ft.encode_features` to properly encode the `product_purchased` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix_train_encoded, features_encoded = ft.encode_features(\n",
    "    feature_matrix_train, features\n",
    ")\n",
    "feature_matrix_train_encoded.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets imagine we have some test data that has doesn't have one of the categorical values (**toothpaste**). Also, the test data has a value that wasn't present in the train data (**water**)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data = pd.DataFrame(\n",
    "    {\n",
    "        \"customer_id\": [6, 7, 8, 9, 10],\n",
    "        \"product_purchased\": [\"coke zero\", \"car\", \"coke zero\", \"coke zero\", \"water\"],\n",
    "    }\n",
    ")\n",
    "\n",
    "es_test = ft.EntitySet(id=\"customer_data\")\n",
    "es_test = es_test.add_dataframe(\n",
    "    dataframe_name=\"customers\", dataframe=test_data, index=\"customer_id\"\n",
    ")\n",
    "\n",
    "feature_matrix_test = ft.calculate_feature_matrix(\n",
    "    entityset=es_test, features=features_encoded\n",
    ")\n",
    "feature_matrix_test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As seen above, we were able to successfully handle the encoding, and deal with the following complications: \n",
    "- **toothpaste** was present in the training data but not present in the testing data \n",
    "- **water** was present in the test data but not present in the training data. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Errors & Warnings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Why am I getting this error 'Index is not unique on dataframe'?\n",
    "You may be trying to create your `EntitySet`, and run into this error. \n",
    "```python\n",
    "IndexError: Index column must be unique\n",
    "```\n",
    "**This is because each dataframe in your EntitySet needs a unique index.**\n",
    "\n",
    "Let's look at a simple example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "product_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 4], \"rating\": [3.5, 4.0, 4.5, 1.5, 5.0]})\n",
    "product_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how the `id` column has a duplicate index of `4`. If you try to add this dataframe to the EntitySet, you will run into the following error."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "es = ft.EntitySet(id=\"product_data\")\n",
    "es = es.add_dataframe(dataframe_name=\"products\",\n",
    "                      dataframe=product_df,\n",
    "                      index=\"id\")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "---------------------------------------------------------------------------\n",
    "IndexError                                Traceback (most recent call last)\n",
    "<ipython-input-78-854fbaf207f8> in <module>\n",
    "      1 es = ft.EntitySet(id=\"product_data\")\n",
    "----> 2 es = es.add_dataframe(dataframe_name=\"products\",\n",
    "      3                       dataframe=product_df,\n",
    "      4                       index=\"id\")\n",
    "\n",
    "~/Code/featuretools/featuretools/entityset/entityset.py in add_dataframe(self, dataframe, dataframe_name, index, logical_types, semantic_tags, make_index, time_index, secondary_time_index, already_sorted)\n",
    "    625             index_was_created, index, dataframe = _get_or_create_index(index, make_index, dataframe)\n",
    "    626 \n",
    "--> 627             dataframe.ww.init(name=dataframe_name,\n",
    "    628                               index=index,\n",
    "    629                               time_index=time_index,\n",
    "\n",
    "/usr/local/Caskroom/miniconda/base/envs/featuretools/lib/python3.8/site-packages/woodwork/table_accessor.py in init(self, index, time_index, logical_types, already_sorted, schema, validate, use_standard_tags, **kwargs)\n",
    "     94         \"\"\"\n",
    "     95         if validate:\n",
    "---> 96             _validate_accessor_params(self._dataframe, index, time_index, logical_types, schema, use_standard_tags)\n",
    "     97         if schema is not None:\n",
    "     98             self._schema = schema\n",
    "\n",
    "/usr/local/Caskroom/miniconda/base/envs/featuretools/lib/python3.8/site-packages/woodwork/table_accessor.py in _validate_accessor_params(dataframe, index, time_index, logical_types, schema, use_standard_tags)\n",
    "    877         # We ignore these parameters if a schema is passed\n",
    "    878         if index is not None:\n",
    "--> 879             _check_index(dataframe, index)\n",
    "    880         if logical_types:\n",
    "    881             _check_logical_types(dataframe.columns, logical_types)\n",
    "\n",
    "/usr/local/Caskroom/miniconda/base/envs/featuretools/lib/python3.8/site-packages/woodwork/table_accessor.py in _check_index(dataframe, index)\n",
    "    903         # User specifies an index that is in the dataframe but not unique\n",
    "--> 904         raise IndexError('Index column must be unique')\n",
    "    905 \n",
    "    906 \n",
    "\n",
    "IndexError: Index column must be unique\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To fix the above error, you can do one of the following solutions:\n",
    "\n",
    "**Solution #1 - You can create a unique index on your Dataframe.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "product_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 5], \"rating\": [3.5, 4.0, 4.5, 1.5, 5.0]})\n",
    "product_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how we now have a unique index column called `id`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = es.add_dataframe(dataframe_name=\"products\", dataframe=product_df, index=\"id\")\n",
    "es"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As seen above, we can now create our DataFrame for our `EntitySet` without an error by creating a unique index in our Dataframe.\n",
    "\n",
    "**Solution #2 - Set make_index to True in your call to add_dataframe to create a new index on that data**\n",
    "- `make_index` creates a unique index for each row by just looking at what number the row is, in relation to all the other rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "product_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 4], \"rating\": [3.5, 4.0, 4.5, 1.5, 5.0]})\n",
    "\n",
    "es = ft.EntitySet(id=\"product_data\")\n",
    "es = es.add_dataframe(\n",
    "    dataframe_name=\"products\", dataframe=product_df, index=\"product_id\", make_index=True\n",
    ")\n",
    "\n",
    "es[\"products\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As seen above, we created our dataframe for our `EntitySet` without an error using the `make_index` argument."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Why am I getting the following warning 'Using training_window but last_time_index is not set'?\n",
    "\n",
    "If you are using a training window, and you haven't set a `last_time_index` for your dataframe, you will get this warning.\n",
    "The training window attribute in Featuretools limits the amount of past data that can be used while calculating a particular feature vector.\n",
    "\n",
    "You can add the `last_time_index` to all dataframes automatically by calling `your_entityset.add_last_time_indexes()` after you create your `EntitySet`. This will remove the warning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "es = ft.demo.load_mock_customer(return_entityset=True)\n",
    "es.add_last_time_indexes()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can run DFS without getting the warning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cutoff_times = pd.DataFrame()\n",
    "cutoff_times[\"customer_id\"] = [1, 2, 3, 1]\n",
    "cutoff_times[\"time\"] = pd.to_datetime(\n",
    "    [\"2014-1-1 04:00\", \"2014-1-1 05:00\", \"2014-1-1 06:00\", \"2014-1-1 08:00\"]\n",
    ")\n",
    "cutoff_times[\"label\"] = [True, True, False, True]\n",
    "\n",
    "feature_matrix, feature_defs = ft.dfs(\n",
    "    entityset=es,\n",
    "    target_dataframe_name=\"customers\",\n",
    "    cutoff_time=cutoff_times,\n",
    "    cutoff_time_in_index=True,\n",
    "    training_window=\"1 hour\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### last_time_index vs. time_index\n",
    "\n",
    "- The `time_index` is when the instance was first known.\n",
    "- The `last_time_index` is when the instance appears for the last time.\n",
    "- For example, a customer’s session has multiple transactions which can happen at different points in time. If we are trying to count the number of sessions a user has in a given time period, we often want to count all the sessions that had any transaction during the training window. To accomplish this, we need to not only know when a session starts (**time_index**), but also when it ends (**last_time_index**). The last time that an instance appears in the data is stored as the `last_time_index` of a dataframe. \n",
    "- Once the last_time_index has been set, Featuretools will check to see if the last_time_index is after the start of the training window. That, combined with the cutoff time, allows DFS to discover which data is relevant for a given training window."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Why am I getting errors with Featuretools on [Google Colab](https://colab.research.google.com/)?\n",
    "\n",
    "[Google Colab](https://colab.research.google.com/), by default, has Featuretools `0.4.1` installed. You may run into issues following our newest guides, or latest documentation while using an older version of Featuretools. Therefore, we suggest you upgrade to the latest featuretools version by doing the following in your notebook in Google Colab:\n",
    "```shell\n",
    "!pip install -U featuretools\n",
    "```\n",
    "\n",
    "You may need to Restart the runtime by doing **Runtime** -> **Restart Runtime**.\n",
    "You can check latest Featuretools version by doing following:\n",
    "```python\n",
    "import featuretools as ft\n",
    "print(ft.__version__)\n",
    "```\n",
    "You should see a version greater than `0.4.1`"
   ]
  }
 ],
 "metadata": {
  "file_extension": ".py",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  },
  "mimetype": "text/x-python",
  "name": "python",
  "npconvert_exporter": "python",
  "pygments_lexer": "ipython3",
  "version": 3,
  "vscode": {
   "interpreter": {
    "hash": "3f6b062a214ec48d1657976024d6bc68979519d14a33afb6ad033fc2e4189514"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: docs/source/resources/help.rst
================================================
Help
====

Couldn't find what you were looking for?
The Featuretools community is happy to provide support to users of Featuretools.


Discussion
----------

Conversation happens in the following places:

1.  **General usage questions** are directed to `StackOverflow`_ with the #featuretools tag.
2.  **Bug reports** are managed on the `GitHub issue
    tracker`_.
3.  **Chat** and collaboration within the community occurs on `Slack`_. For general usage questions, please post on
    Stack Overflow where answers are more searchable by other users.

.. _`StackOverflow`: http://stackoverflow.com/questions/tagged/featuretools
.. _`Github issue tracker`: https://github.com/alteryx/featuretools/issues
.. _`Slack`: https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA


Asking for help
---------------
All users levels, including beginners, should feel free to ask questions and
report bugs when using featuretools. You can get better answers if follow a
few simple guidelines:

1.  **Use the right resource**: We suggest using Github or StackOverflow.
    Questions asked at these locations will be more searchable for other users.

    - Slack should be used for community discussion and collaboration.
    - For general questions on how something should work or tips, use StackOverflow.
    - Bugs should be reported on Github.

2.  **Ask in one place only**: Please post your question in one place
    (StackOverflow or Github).

3.  **Use examples**: Make `minimal, complete, verifiable examples
    <https://stackoverflow.com/help/mcve>`_. You will get
    much better answers if your provide code that people can use to reproduce
    your problem.


================================================
FILE: docs/source/resources/resources_index.rst
================================================
Resources
---------

Frequently asked questions and additional resources

.. toctree::
   :maxdepth: 1

   transition_to_ft_v1.0
   frequently_asked_questions
   help
   usage_tips/limitations
   usage_tips/glossary
   ecosystem


================================================
FILE: docs/source/resources/transition_to_ft_v1.0.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6004844f",
   "metadata": {},
   "source": [
    "# Transitioning to Featuretools Version 1.0\n",
    "\n",
    "Featuretools version 1.0 incorporates many significant changes that impact the way EntitySets are created, how primitives are defined, and in some cases the resulting feature matrix that is created. This document will provide an overview of the significant changes, helping existing Featuretools users transition to version 1.0.\n",
    "\n",
    "## Background and Introduction\n",
    "\n",
    "### Why make these changes?\n",
    "The lack of a unified type system across libraries makes sharing information between libraries more difficult. This problem led to the development of [Woodwork](https://woodwork.alteryx.com/en/stable/). Updating Featuretools to use Woodwork for managing column typing information enables easy sharing of feature matrix column types with other libraries without costly conversions between custom type systems. As an example, [EvalML](https://evalml.alteryx.com/en/stable/), which has also adopted Woodwork, can now use Woodwork typing information on a feature matrix directly to create machine learning models, without first inferring or redefining column types.\n",
    "\n",
    "Other benefits of using Woodwork for managing typing in Featuretools include:\n",
    "\n",
    "- Simplified code - custom type management code has been removed\n",
    "- Seamless integration of new types and improvements to type integration as Woodwork improves\n",
    "- Easy and flexible storage of additional information about columns. For example, we can now store whether a feature was engineered by Featuretools or present in the original data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a9bfede",
   "metadata": {},
   "source": [
    "### What has changed?\n",
    "- The legacy Featuretools custom typing system has been replaced with Woodwork for managing column types\n",
    "- Both the `Entity` and `Variable` classes have been removed from Featuretools\n",
    "- Several key Featuretools methods have been moved or updated\n",
    "\n",
    "#### Comparison between legacy typing system and Woodwork typing systems\n",
    "| Featuretools < 1.0 | Featuretools 1.0 | Description |\n",
    "| ---- | ---- | ---- |\n",
    "| Entity | Woodwork DataFrame | stores typing information for all columns |\n",
    "| Variable | ColumnSchema | stores typing information for a single column |\n",
    "| Variable subclass | LogicalType and semantic_tags | elements used to define a column type |\n",
    "\n",
    "#### Summary of significant method changes\n",
    "\n",
    "The table below outlines the most significant changes that have occurred. In Summary: In some cases, the method arguments have also changed, and those changes are outlined in more detail throughout this document.\n",
    "\n",
    "| Older Versions | Featuretools 1.0 |\n",
    "| ---- | ---- |\n",
    "| EntitySet.entity_from_dataframe | EntitySet.add_dataframe |\n",
    "| EntitySet.normalize_entity | EntitySet.normalize_dataframe |\n",
    "| EntitySet.update_data | EntitySet.replace_dataframe |\n",
    "| Entity.variable_types | es['dataframe_name'].ww |\n",
    "| es['entity_id']['variable_name'] | es['dataframe_name'].ww.columns['column_name'] |\n",
    "| Entity.convert_variable_type | es['dataframe_name'].ww.set_types |\n",
    "| Entity.add_interesting_values | es.add_interesting_values(dataframe_name='df_name', ...) |\n",
    "| Entity.set_secondary_time_index | es.set_secondary_time_index(dataframe_name='df_name', ...) |\n",
    "| Feature(es['entity_id']['variable_name']) | Feature(es['dataframe_name'].ww['column_name']) |\n",
    "| dfs(target_entity='entity_id', ...) | dfs(target_dataframe_name='dataframe_name', ...) |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3b1e217",
   "metadata": {},
   "source": [
    "For more information on how Woodwork manages typing information, refer to the [Woodwork Understanding Types and Tags](https://woodwork.alteryx.com/en/stable/guides/logical_types_and_semantic_tags.html) guide."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8453248",
   "metadata": {},
   "source": [
    "### What do these changes mean for users?\n",
    "Removing these classes required moving several methods from the `Entity` to the `EntitySet` object. This change also impacts the way relationships, features and primitives are defined, requiring different parameters than were previously required. Also, because the Woodwork typing system is not identical to the old Featuretools typing system, in some cases the feature matrix that is returned can be slightly different as a result of columns being identified as different types.\n",
    "\n",
    "All of these changes, and more, will be reviewed in detail throughout this document, providing examples of both the old and new API where possible."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de402e3b",
   "metadata": {},
   "source": [
    "## Removal of `Entity` Class and Updates to `EntitySet`\n",
    "\n",
    "In previous versions of Featuretools an EntitySet was created by adding multiple entities and then defining relationships between variables (columns) in different entities. Starting in Featuretools version 1.0, EntitySets are now created by adding multiple dataframes and defining relationships between columns in the dataframes. While conceptually similar, there are some minor differences in the process.\n",
    "\n",
    "### Adding dataframes to an EntitySet\n",
    "\n",
    "When adding dataframes to an EntitySet, users can pass in a Woodwork dataframe or a regular dataframe without Woodwork typing information. If users supply a dataframe that has Woodwork typing information initialized, Featuretools will simply use this typing information directly. If users supply a dataframe without Woodwork initialized, Featuretools will initialize Woodwork on the dataframe, performing type inference for any column that does not have typing information specified.\n",
    "\n",
    "Below are some examples to illustrate this process. First we will create two small dataframes to use for the example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5bea1bd4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "import featuretools as ft"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b094ca23",
   "metadata": {},
   "outputs": [],
   "source": [
    "orders_df = pd.DataFrame(\n",
    "    {\"order_id\": [0, 1, 2], \"order_date\": [\"2021-01-02\", \"2021-01-03\", \"2021-01-04\"]}\n",
    ")\n",
    "items_df = pd.DataFrame(\n",
    "    {\n",
    "        \"id\": [0, 1, 2, 3, 4],\n",
    "        \"order_id\": [0, 1, 1, 2, 2],\n",
    "        \"item_price\": [29.95, 4.99, 10.25, 20.50, 15.99],\n",
    "        \"on_sale\": [False, True, False, True, False],\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db705814",
   "metadata": {},
   "source": [
    "With older versions of Featuretools, users would first create an EntitySet object, and then add dataframes to the EntitySet, by calling `entity_from_dataframe` as shown below.\n",
    "\n",
    "```python\n",
    "es = ft.EntitySet('old_es')\n",
    "\n",
    "es.entity_from_dataframe(dataframe=orders_df,\n",
    "                         entity_id='orders',\n",
    "                         index='order_id',\n",
    "                         time_index='order_date')\n",
    "es.entity_from_dataframe(dataframe=items_df,\n",
    "                         entity_id='items',\n",
    "                         index='id')\n",
    "```\n",
    "\n",
    "```\n",
    "Entityset: old_es\n",
    "  Entities:\n",
    "    orders [Rows: 3, Columns: 2]\n",
    "    items [Rows: 5, Columns: 3]\n",
    "  Relationships:\n",
    "    No relationships\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6f95f35",
   "metadata": {},
   "source": [
    "With Featuretools 1.0, the steps for adding a dataframe to an EntitySet are the same, but some of the details have changed. First, create an EntitySet as before. To add the dataframe call `EntitySet.add_dataframe` in place of the previous `EntitySet.entity_from_dataframe` call. Note that the name of the dataframe is specified in the `dataframe_name` argument, which was previously called `entity_id`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1fdffe4",
   "metadata": {},
   "outputs": [],
   "source": [
    "es = ft.EntitySet(\"new_es\")\n",
    "\n",
    "es.add_dataframe(\n",
    "    dataframe=orders_df,\n",
    "    dataframe_name=\"orders\",\n",
    "    index=\"order_id\",\n",
    "    time_index=\"order_date\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c983744",
   "metadata": {},
   "source": [
    "You can also define the name, index, and time index by first [initializing Woodwork](https://woodwork.alteryx.com/en/stable/generated/woodwork.table_accessor.WoodworkTableAccessor.init.html#woodwork.table_accessor.WoodworkTableAccessor.init) on the dataframe and then passing the Woodwork initialized dataframe directly to the `add_dataframe` call. For this example we will initialize Woodwork on `items_df`, setting the dataframe name as `items` and specifying that the index should be the `id` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d5ad8e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "items_df.ww.init(name=\"items\", index=\"id\")\n",
    "items_df.ww"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07f5f27c",
   "metadata": {},
   "source": [
    "With Woodwork initialized, we no longer need to specify values for the `dataframe_name` or `index` arguments when calling `add_dataframe` as Featuretools will simply use the values that were already specified when Woodwork was initialized."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f4ab39a",
   "metadata": {},
   "outputs": [],
   "source": [
    "es.add_dataframe(dataframe=items_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93814387",
   "metadata": {},
   "source": [
    "### Accessing column typing information\n",
    "\n",
    "Previously, column variable type information could be accessed for an entire Entity through `Entity.variable_types` or for an individual column by selecting the individual column first through `es['entity_id']['col_id']`.\n",
    "\n",
    "```python\n",
    "es['items'].variable_types\n",
    "```\n",
    "```\n",
    "{'id': featuretools.variable_types.variable.Index,\n",
    " 'order_id': featuretools.variable_types.variable.Numeric,\n",
    " 'item_price': featuretools.variable_types.variable.Numeric}\n",
    "```\n",
    "```python\n",
    "es['items']['item_price']\n",
    "```\n",
    "```\n",
    "<Variable: item_price (dtype = numeric)>\n",
    "```\n",
    "\n",
    "With the updated version of Featuretools, the logical types and semantic tags for all of the columns in a single dataframe can be viewed through the `.ww` namespace on the dataframe. First, select the dataframe from the EntitySet with `es['dataframe_name']` and then access the typing information by chaining a `.ww` call on the end as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6abb9b10",
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"items\"].ww"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72775903",
   "metadata": {},
   "source": [
    "The logical type and semantic tags for a single column can be obtained from the Woodwork columns dictionary stored on the dataframe, returning a `Woodwork.ColumnSchema` object that stores the typing information:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da516642",
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"items\"].ww.columns[\"item_price\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50f9f70a",
   "metadata": {},
   "source": [
    "### Type inference and updating column types\n",
    "\n",
    "Featuretools will attempt to infer types for any columns that do not have types defined by the user. Prior to version 1.0, Featuretools implemented custom type inference code to determine what variable type should be assigned to each column. You could see the inferred variable types by viewing the contents of the `Entity.variable_types` dictionary.\n",
    "\n",
    "Starting in Featuretools 1.0, column type inference is being handled by Woodwork. Any columns that do not have a logical type assigned by the user when adding a dataframe to an EntitySet will have their logical types inferred by Woodwork. As before, type inference can be skipped for any columns in a dataframe by passing the appropriate logical types in a dictionary when calling `EntitySet.add_dataframe`.\n",
    "\n",
    "As an example, we can create a new dataframe and add it to an EntitySet, specifying the logical type for the user's full name as the Woodwork `PersonFullName` logical type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a34016b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "users_df = pd.DataFrame(\n",
    "    {\"id\": [0, 1, 2], \"name\": [\"John Doe\", \"Rita Book\", \"Teri Dactyl\"]}\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d999e022",
   "metadata": {},
   "outputs": [],
   "source": [
    "es.add_dataframe(\n",
    "    dataframe=users_df,\n",
    "    dataframe_name=\"users\",\n",
    "    index=\"id\",\n",
    "    logical_types={\"name\": \"PersonFullName\"},\n",
    ")\n",
    "\n",
    "es[\"users\"].ww"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2eff5e1",
   "metadata": {},
   "source": [
    "Looking at the typing information above, we can see that the logical type for the `name` column was set to `PersonFullName` as we specified.\n",
    "\n",
    "Situations will occur where type inference identifies a column as having the incorrect logical type. In these situations, the logical type can be updated using the Woodwork `set_types` method. Let's say we want the `order_id` column of the `orders` dataframe to have a `Categorical` logical type instead of the `Integer` type that was inferred. Previously, this would have accomplished through the `Entity.convert_variable_type` method.\n",
    "\n",
    "```python\n",
    "from featuretools.variable_types import Categorical\n",
    "\n",
    "es['items'].convert_variable_type(variable_id='order_id', new_type=Categorical)\n",
    "```\n",
    "\n",
    "Now, we can perform this same update using Woodwork:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6c095b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"items\"].ww.set_types(logical_types={\"order_id\": \"Categorical\"})\n",
    "es[\"items\"].ww"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9d84e08",
   "metadata": {},
   "source": [
    "For additional information on Woodwork typing and how it is used in Featuretools, refer to [Woodwork Typing in Featuretools](../getting_started/woodwork_types.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf3dfea2",
   "metadata": {},
   "source": [
    "### Adding interesting values\n",
    "\n",
    "Interesting values can be added to all dataframes in an EntitySet, a single dataframe in an EntitySet, or to a single column of a dataframe in an EntitySet.\n",
    "\n",
    "To add interesting values for all of the dataframes in an EntitySet, simply call `EntitySet.add_interesting_values`, optionally specifying the maximum number of values to add for each column. This remains unchanged from older versions of Featuretools to the 1.0 release.\n",
    "\n",
    "Adding values for a single dataframe or for a single column has changed. Previously to add interesting values for an Entity, users would call `Entity.add_interesting_values()`:\n",
    "```python\n",
    "es['items'].add_interesting_values()\n",
    "```\n",
    "\n",
    "Now, in order to specify interesting values for a single dataframe, you call `add_interesting_values` on the EntitySet, and pass the name of the dataframe for which you want interesting values added:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c058d2ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "es.add_interesting_values(dataframe_name=\"items\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3e0a247",
   "metadata": {},
   "source": [
    "Previously, to manually add interesting values for a column, you would simply assign them to the attribute of the variable:\n",
    "\n",
    "```python\n",
    "es['items']['order_id'].interesting_values = [1, 2]\n",
    "```\n",
    "\n",
    "Now, this is done through `EntitySet.add_interesting_values`, passing in the name of the dataframe and a dictionary mapping column names to the interesting values to assign for that column. For example, to assign the interesting values of `[1, 2]` to the `order_id` column of the `items` dataframe, use the following approach:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8276114b",
   "metadata": {},
   "outputs": [],
   "source": [
    "es.add_interesting_values(dataframe_name=\"items\", values={\"order_id\": [1, 2]})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22e70b84",
   "metadata": {},
   "source": [
    "Interesting values for multiple columns in the same dataframe can be assigned by adding more entries to the dictionary passed to the `values` parameter.\n",
    "\n",
    "Accessing interesting values has changed as well. Previously interesting values could be viewed from the variable:\n",
    "```python\n",
    "es['items']['order_id'].interesting_values\n",
    "```\n",
    "\n",
    "Interesting values are now stored in the Woodwork metadata for the columns in a dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8461c4f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "es[\"items\"].ww.columns[\"order_id\"].metadata[\"interesting_values\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb23501f",
   "metadata": {},
   "source": [
    "### Setting a secondary time index\n",
    "\n",
    "In earlier versions of Featuretools, a secondary time index could be set on an Entity by calling `Entity.set_secondary_time_index`. \n",
    "```python\n",
    "es_flight = ft.demo.load_flight(nrows=100)\n",
    "\n",
    "arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay',\n",
    "                    'national_airspace_delay', 'security_delay',\n",
    "                    'late_aircraft_delay', 'canceled', 'diverted',\n",
    "                    'taxi_in', 'taxi_out', 'air_time', 'dep_time']\n",
    "es_flight['trip_logs'].set_secondary_time_index({'arr_time': arr_time_columns})\n",
    "```\n",
    "\n",
    "Since the `Entity` class has been removed in Featuretools 1.0, this now needs to be done through the `EntitySet` instead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b80b1f6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "es_flight = ft.demo.load_flight(nrows=100)\n",
    "\n",
    "arr_time_columns = [\n",
    "    \"arr_delay\",\n",
    "    \"dep_delay\",\n",
    "    \"carrier_delay\",\n",
    "    \"weather_delay\",\n",
    "    \"national_airspace_delay\",\n",
    "    \"security_delay\",\n",
    "    \"late_aircraft_delay\",\n",
    "    \"canceled\",\n",
    "    \"diverted\",\n",
    "    \"taxi_in\",\n",
    "    \"taxi_out\",\n",
    "    \"air_time\",\n",
    "    \"dep_time\",\n",
    "]\n",
    "es_flight.set_secondary_time_index(\n",
    "    dataframe_name=\"trip_logs\", secondary_time_index={\"arr_time\": arr_time_columns}\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ebee2e6",
   "metadata": {},
   "source": [
    "Previously, the secondary time index could be accessed directly from the Entity with `es_flight['trip_logs'].secondary_time_index`. Starting in Featuretools 1.0 the secondary time index and the associated columns are stored in the Woodwork dataframe metadata and can be accessed as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ea95fdb",
   "metadata": {},
   "outputs": [],
   "source": [
    "es_flight[\"trip_logs\"].ww.metadata[\"secondary_time_index\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2f9b64c",
   "metadata": {},
   "source": [
    "### Normalizing Entities/DataFrames\n",
    "\n",
    "`EntitySet.normalize_entity` has been renamed to `EntitySet.normalize_dataframe` in Featuretools 1.0. The new method works in the same way as the old method, but some of the parameters have been renamed. The table below shows the old and new names for reference. When calling this method, the new parameter names need to be used.\n",
    "\n",
    "| Old Parameter Name | New Parameter Name |\n",
    "| --- | --- |\n",
    "| base_entity_id | base_dataframe_name |\n",
    "| new_entity_id | new_dataframe_name |\n",
    "| additional_variables | additional_columns |\n",
    "| copy_variables | copy_columns |\n",
    "| new_entity_time_index | new_dataframe_time_index |\n",
    "| new_entity_secondary_time_index | new_dataframe_secondary_time_index |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca81708b",
   "metadata": {},
   "source": [
    "### Defining and adding relationships\n",
    "\n",
    "In earlier versions of Featuretools, relationships were defined by creating a `Relationship` object, which took two `Variables` as inputs. To define a relationship between the orders Entity and the items Entity, we would first create a `Relationship` and then add it to the EntitySet:\n",
    "\n",
    "```python\n",
    "relationship = ft.Relationship(es['orders']['order_id'], es['items']['order_id'])\n",
    "es.add_relationship(relationship)\n",
    "```\n",
    "\n",
    "With Featuretools 1.0, the process is similar, but there are two different ways to add the relationship to the EntitySet. One way is to pass the dataframe and column names to `EntitySet.add_relationship`, and another is to pass a previously created `Relationship` object to the `relationship` keyword argument. Both approaches are demonstrated below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d738807",
   "metadata": {
    "nbshpinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Undo change from above and change child column logical type to match parent and prevent warning\n",
    "# NOTE: This cell is hidden in the docs build\n",
    "es[\"items\"].ww.set_types(logical_types={\"order_id\": \"Integer\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97c04dd4",
   "metadata": {},
   "outputs": [],
   "source": [
    "es.add_relationship(\n",
    "    parent_dataframe_name=\"orders\",\n",
    "    parent_column_name=\"order_id\",\n",
    "    child_dataframe_name=\"items\",\n",
    "    child_column_name=\"order_id\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26643d04",
   "metadata": {
    "nbshpinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Reset the relationship so we can add it again\n",
    "# NOTE: This cell is hidden in the docs build\n",
    "es.relationships = []"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "317e5657",
   "metadata": {},
   "source": [
    "Alternatively, we can first create a `Relationship` and pass that to `EntitySet.add_relationship`. When defining a `Relationship` we need to pass in the EntitySet to which it belongs along with the names for the parent dataframe and parent column and the name of the child dataframe and child column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47e54c72",
   "metadata": {},
   "outputs": [],
   "source": [
    "relationship = ft.Relationship(\n",
    "    entityset=es,\n",
    "    parent_dataframe_name=\"orders\",\n",
    "    parent_column_name=\"order_id\",\n",
    "    child_dataframe_name=\"items\",\n",
    "    child_column_name=\"order_id\",\n",
    ")\n",
    "es.add_relationship(relationship=relationship)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a49ba91",
   "metadata": {},
   "source": [
    "### Updating data for a dataframe in an EntitySet\n",
    "\n",
    "Previously to update (replace) the data associated with an Entity, users could call `Entity.update_data` and pass in the new dataframe. As an example, let's update the data in our `users` Entity:\n",
    "```python\n",
    "new_users_df = pd.DataFrame({\n",
    "    'id': [3, 4],\n",
    "    'name': ['Anne Teak', 'Art Decco']\n",
    "})\n",
    "\n",
    "es['users'].update_data(df=new_users_df)\n",
    "```\n",
    "\n",
    "To accomplish this task with Featuretools 1.0, we will use the `EntitySet.replace_dataframe` method instead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b45a81d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "new_users_df = pd.DataFrame({\"id\": [0, 1], \"name\": [\"Anne Teak\", \"Art Decco\"]})\n",
    "\n",
    "es.replace_dataframe(dataframe_name=\"users\", df=new_users_df)\n",
    "es[\"users\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "679af861",
   "metadata": {},
   "source": [
    "## Defining features\n",
    "\n",
    "The syntax for defining features has changed slightly in Featuretools 1.0. Previously, identity features could be defined simply by passing in the variable that should be used to build the feature.\n",
    "\n",
    "```python\n",
    "feature = ft.Feature(es['items']['item_price'])\n",
    "```\n",
    "\n",
    "Starting with Featuretools 1.0, a similar syntax can be used, but because `es['items']` will now return a Woodwork dataframe instead of an `Entity`, we need to update the syntax slightly to access the Woodwork column. To update, simply add `.ww` between the dataframe name selector and the column selector as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88902f6b",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature = ft.Feature(es[\"items\"].ww[\"item_price\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0faf41e4",
   "metadata": {},
   "source": [
    "## Defining primitives\n",
    "\n",
    "In earlier versions of Featuretools, primitive input and return types were defined by specifying the appropriate `Variable` class. Starting in version 1.0, the input and return types are defined by Woodwork `ColumnSchema` objects. \n",
    "\n",
    "To illustrate this change, let's look closer at the `Age` transform primitive. This primitive takes a datetime representing a date of birth and returns a numeric value corresponding to a person's age. In previous versions of Featuretools, the input type was defined by specifying the `DateOfBirth` variable type and the return type was specified by the `Numeric` variable type:\n",
    "\n",
    "```python\n",
    "input_types = [DateOfBirth]\n",
    "return_type = Numeric\n",
    "```\n",
    "\n",
    "Woodwork does not have a specific `DateOfBirth` logical type, but rather identifies a column as a date of birth column by specifying the logical type as `Datetime` with a semantic tag of `date_of_birth`. There is also no `Numeric` logical type in Woodwork, but rather Woodwork identifies all columns that can be used for numeric operations with the semantic tag of `numeric`. Furthermore, we know the `Age` primitive will return a floating point number, which would correspond to a Woodwork logical type of `Double`. With these items in mind, we can redefine the `Age` input types and return types with `ColumnSchema` objects as follows:\n",
    "\n",
    "```python\n",
    "input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={'date_of_birth'})]\n",
    "return_type = ColumnSchema(logical_type=Double, semantic_tags={'numeric'})\n",
    "```\n",
    "\n",
    "Aside from changing the way input and return types are defined, the rest of the process for defining primitives remains unchanged."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebcd6d9e",
   "metadata": {},
   "source": [
    "### Mapping from old Featuretools variable types to Woodwork ColumnSchemas\n",
    "\n",
    "Types defined by Woodwork differ from the old variable types that were defined by Featuretools prior to version 1.0. While there is not a direct mapping from the old variable types to the new Woodwork types defined by `ColumnSchema` objects, the approximate mapping is shown below.\n",
    "\n",
    "\n",
    "| Featuretools Variable | Woodwork Column Schema |\n",
    "| --- | --- |\n",
    "| Boolean | ColumnSchema(logical_type=Boolean) or ColumnSchema(logical_type=BooleanNullable) |\n",
    "| Categorical | ColumnSchema(logical_type=Categorical) |\n",
    "| CountryCode | ColumnSchema(logical_type=CountryCode) |\n",
    "| Datetime | ColumnSchema(logical_type=Datetime) |\n",
    "| DateOfBirth | ColumnSchema(logical_type=Datetime, semantic_tags={'date_of_birth'}) |\n",
    "| DatetimeTimeIndex | ColumnSchema(logical_type=Datetime, semantic_tags={'time_index'}) |\n",
    "| Discrete | ColumnSchema(semantic_tags={'category'}) |\n",
    "| EmailAddress | ColumnSchema(logical_type=EmailAddress) |\n",
    "| FilePath | ColumnSchema(logical_type=Filepath) |\n",
    "| FullName | ColumnSchema(logical_type=PersonFullName) |\n",
    "| Id | ColumnSchema(semantic_tags={'foreign_key'}) |\n",
    "| Index | ColumnSchema(semantic_tags={'index'}) |\n",
    "| IPAddress | ColumnSchema(logical_type=IPAddress) |\n",
    "| LatLong | ColumnSchema(logical_type=LatLong) |\n",
    "| NaturalLanguage | ColumnSchema(logical_type=NaturalLanguage) |\n",
    "| Numeric | ColumnSchema(semantic_tags={'numeric'}) |\n",
    "| NumericTimeIndex | ColumnSchema(semantic_tags={'numeric', 'time_index'}) |\n",
    "| Ordinal | ColumnSchema(logical_type=Ordinal) |\n",
    "| PhoneNumber | ColumnSchema(logical_type=PhoneNumber) |\n",
    "| SubRegionCode | ColumnSchema(logical_type=SubRegionCode) |\n",
    "| Timedelta | ColumnSchema(logical_type=Timedelta) |\n",
    "| TimeIndex | ColumnSchema(semantic_tags={'time_index'}) |\n",
    "| URL | ColumnSchema(logical_type=URL) |\n",
    "| Unknown | ColumnSchema(logical_type=Unknown) |\n",
    "| ZIPCode | ColumnSchema(logical_type=PostalCode) |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fec87370",
   "metadata": {},
   "source": [
    "## Changes to Deep Feature Synthesis and Calculate Feature Matrix\n",
    "\n",
    "The argument names for both `featuretools.dfs` and `featuretools.calculate_feature_matrix` have changed slightly in Featuretools 1.0. In prior versions, users could generate a list of features using the default primitives and options like this:\n",
    "\n",
    "```python\n",
    "features = ft.dfs(entityset=es,\n",
    "                  target_entity='items',\n",
    "                  features_only=True)\n",
    "```\n",
    "\n",
    "In Featuretools 1.0, the `target_entity` argument has been renamed to `target_dataframe_name`, but otherwise this basic call remains the same.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5428949c",
   "metadata": {},
   "outputs": [],
   "source": [
    "features = ft.dfs(entityset=es, target_dataframe_name=\"items\", features_only=True)\n",
    "features"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3154734d",
   "metadata": {},
   "source": [
    "In addition, the `dfs` argument `ignore_entities` was renamed to `ignore_dataframes` and `ignore_variables` was renamed to `ignore_columns`. Similarly, if specifying primitive options, all references to `entities` should be replaced with `dataframes` and references to `variables` should be replaced with columns. For example, the primitive option of `include_groupby_entities` is now `include_groupby_dataframes` and `include_variables` is now `include_columns`.\n",
    "\n",
    "The basic call to `featuretools.calculate_feature_matrix` remains unchanged if passing in an EntitySet along with a list of features to caluculate. However, users calling `calculate_feature_matrix` by passing in a list of `entities` and `relationships` should note that the `entities` argument has been renamed to `dataframes` and the values in the dictionary values should now include Woodwork logical types instead of Featuretools `Variable` classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "456da22e",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix = ft.calculate_feature_matrix(features=features, entityset=es)\n",
    "feature_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b87489cf",
   "metadata": {},
   "source": [
    "In addition to the changes in argument names, there are a couple other changes to the returned feature matrix that users should be aware of. First, because of slight differences in the way Woodwork defines column types compared to how the prior Featuretools implementation did, there can be some differences in the features that are generated between old and new versions. The most notable impact is in the way foreign key columns are handled. Previously, Featuretools treated all foreign key (previously `Id`) columns as categorical columns, and would generate appropriate features from these columns. Starting in version 1.0, foreign key columns are not constrained to be categorical, and if they are another type such as `Integer`, features will not be generated from these columns. Manually converting foreign key columns to `Categorical` as shown above will result in features much closer to those achieved with previous versions.\n",
    "\n",
    "Also, because Woodwork's type inference process differs from the previous Featuretools type inference process, an EntitySet may have column types identified differently. This difference in column types could impact the features that are generated. If it is important to have the same set of features, check all of the logical types in the EntitySet dataframes and update them to the expected types if there are columns that have been inferred as unexpected types.\n",
    "\n",
    "Finally, the feature matrix calculated by Featuretools will now have Woodwork initialized. This means that users can view feature matrix column typing information through the Woodwork namespace as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cdb45cc9",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix.ww"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68910d73",
   "metadata": {},
   "source": [
    "Featuretools now labels features by whether they were originally in the dataframes, or whether they were created by Featuretools. This information is stored in the Woodwork `origin` attribute for the column. Columns that were in the original data will be labeled with `base` and features that were created by Featuretools will be labeled with `engineered`.\n",
    "\n",
    "As a demonstration of how to access this information, let's compare two features in the feature matrix: `item_price` and `orders.MEAN(items.item_price)`. `item_price` was present in the original data, and `orders.MEAN(items.item_price)` was created by Featuretools."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3e143fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix.ww[\"item_price\"].ww.origin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12cf8260",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_matrix.ww[\"orders.MEAN(items.item_price)\"].ww.origin"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c429c75",
   "metadata": {},
   "source": [
    "## Other changes\n",
    "\n",
    "In addition to the changes outlined above, there are several other smaller changes in Featuretools 1.0 of which existing users should be aware.\n",
    "\n",
    "- Column ordering of an dataframe in an EntitySet might be different than it was before. Previously, Featuretools would reorder the columns such that the index column would always be the first column in the dataframe. This behavior has been removed, and the index column is no longer guaranteed to be the first column in the dataframe. Now the index column will remain in the position it was when the dataframe was added to the EntitySet.\n",
    "\n",
    "- For `LatLong` columns, older versions of Featuretools would replace single `nan` values in the columns with a tuple `(nan, nan)`. This is no longer the case, and single `nan` values will now remain in the `LatLong` column. Based on the behavior in Woodwork, any values of `(nan, nan)` in a `LatLong` column will be replaced with a single `nan` value.\n",
    "\n",
    "- Since Featuretools no longer defines `Variable` objects with relationships between them, the `featuretools.variable_types.graph_variable_types` function has been removed.\n",
    "\n",
    "- The `featuretools.variable_types.list_variable_types` utility function has been removed and replaced with two corresponding Woodwork functions: `woodwork.list_logical_types` and `woodwork.list_semantic_tags`. Starting in Featuretools 1.0, the Woodwork utility functions should be used to obtain information on the logical types and semantic tags that can be applied to dataframe columns."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: docs/source/resources/usage_tips/glossary.rst
================================================
.. _glossary:
.. currentmodule:: featuretools

Glossary
========

.. glossary::
    :sorted:

    feature
        A transformation of data used for machine learning.  Featuretools has a custom language for defining features as described :ref:`here <primitives>`. All features are represented by subclasses of :class:`FeatureBase`.

    feature engineering
        The process of transforming data into representations that are better for machine learning.

    cutoff time
        The last point in time data is allowed to be used when calculating a feature

    EntitySet
        A collection of dataframes and the relationships between them. Represented by the :class:`.EntitySet` class.

    instance
        Equivalent to a row in a relational database. Each dataframe has many instances, and each instance has a value for each column and feature defined on the dataframe.

    target dataframe
        The dataframe for which we will be making features

    parent dataframe
        A dataframe that is referenced by another dataframe via relationship. The "one" in a one-to-many relationship.

    child dataframe
        A dataframe that references another dataframe via relationship. The "many" in a one-to-many relationship.

    relationship
        A mapping between a parent dataframe and a child dataframe. The child dataframe must contain a column referencing the index column on the parent dataframe. Represented by the :class:`.Relationship` class.

    logical type
        Additional information about how a column should be interpreted or parsed beyond how the data is stored on disk or in memory. Used to determine which primitives can be applied to a column to generate features.

    semantic tag
        Optional additional information on the column about the meaning or potential uses of data. Used to determine which primitives can be applied to a column to generate features.

    ColumnSchema
        All of a Woodwork column's type information including the logical type and any semantic tags.


================================================
FILE: docs/source/resources/usage_tips/limitations.rst
================================================
Limitations
-----------
In-memory
*********

Featuretools is intended to be run on datasets that can fit in memory on one machine. For advice on handing large dataset refer to :ref:`Improving Computational Performance <performance>`.

Bring your own labels
*********************

If you are doing supervised machine learning, you must supply your own labels and cutoff times. To structure this process, you can use `Compose <https://compose.featurelabs.com>`_, which is an open source project for automatically generating labels with cutoff times.


================================================
FILE: docs/source/set-headers.py
================================================
import urllib.request

opener = urllib.request.build_opener()
opener.addheaders = [("Testing", "True")]
urllib.request.install_opener(opener)


================================================
FILE: docs/source/setup.py
================================================
import os

import featuretools as ft


def load_feature_plots():
    es = ft.demo.load_mock_customer(return_entityset=True)
    path = os.path.join(
        os.path.dirname(os.path.abspath(__file__)),
        "getting_started/graphs/",
    )
    agg_feat = ft.AggregationFeature(
        ft.IdentityFeature(es["sessions"].ww["session_id"]),
        "customers",
        ft.primitives.Count,
    )
    trans_feat = ft.TransformFeature(
        ft.IdentityFeature(es["customers"].ww["join_date"]),
        ft.primitives.TimeSincePrevious,
    )
    demo_feat = ft.AggregationFeature(
        ft.TransformFeature(
            ft.IdentityFeature(es["transactions"].ww["transaction_time"]),
            ft.primitives.Weekday,
        ),
        "sessions",
        ft.primitives.Mode,
    )
    ft.graph_feature(agg_feat, to_file=os.path.join(path, "agg_feat.dot"))
    ft.graph_feature(trans_feat, to_file=os.path.join(path, "trans_feat.dot"))
    ft.graph_feature(demo_feat, to_file=os.path.join(path, "demo_feat.dot"))


if __name__ == "__main__":
    load_feature_plots()


================================================
FILE: docs/source/templates/layout.html
================================================
{% extends "!layout.html" %}

{%- block extrahead %}


{% set image = 'https://alteryx-oss-web-images.s3.amazonaws.com/OpenSource_OpenGraph_1200x630px-featuretools.png' %}
{% set description = 'Automated feature engineering in Python' %}
{% if meta is defined %}
    {% if meta.description is defined %}
        {% set description = meta.description %}
    {% endif %}
{% endif %}

<meta property="og:title" content="{{ title|striptags|e }}{{ titlesuffix }}">
<meta content="{{description}}" />
<meta property="og:description" content="{{description}}">
<meta property="og:image" content="{{image}}">
<meta property="twitter:image" content="{{image}}">
<meta name="twitter:card" content="summary_large_image">


{% endblock %}

{%- block footer %}

<footer class="footer">
  <div class="footer-container">
    <div class="footer-cell-1">
      <img class="footer-image-alteryx" src="{{ pathto('_static/images/alteryx_open_source.svg', 1) }}" alt="Alteryx Open Source">
    </div>
    <div class="footer-cell-2">
      <a href="https://github.com/alteryx/featuretools#readme" target="_blank">
        <img  class="footer-image-github" src="{{ pathto('_static/images/github.svg', 1) }}" alt="GitHub">
      </a>
      <a href="https://twitter.com/AlteryxOSS" target="_blank">
        <img  class="footer-image-twitter" src="{{ pathto('_static/images/twitter.svg', 1) }}" alt="Twitter">
      </a>
      <a href="https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA" target="_blank">
        <img  class="footer-image-github" src="{{ pathto('_static/images/slack.svg', 1) }}" alt="Slack">
      </a>
      <a href="https://stackoverflow.com/questions/tagged/featuretools" target="_blank">
        <img  class="footer-image-github" src="{{ pathto('_static/images/stackoverflow.svg', 1) }}" alt="Stack Overflow">
      </a>
    </div>
    <div class="footer-cell-3">
      <hr class="footer-line">
    </div>
    <div class="footer-cell-4">
      <img class="footer-image-copyright" src="{{ pathto('_static/images/copyright.svg', 1) }}" alt="Copyright">
    </div>
  </div>
</footer>

{% endblock %}


================================================
FILE: featuretools/__init__.py
================================================
# flake8: noqa
from featuretools.version import __version__
from featuretools.config_init import config
from featuretools.entityset.api import *
from featuretools import primitives
from featuretools.synthesis.api import *
from featuretools.primitives import list_primitives, summarize_primitives
from featuretools.computational_backends.api import *
from featuretools import tests
from featuretools.utils.recommend_primitives import get_recommended_primitives
from featuretools.utils.time_utils import *
from featuretools.utils.utils_info import show_info
import featuretools.demo
from featuretools import feature_base
from featuretools import selection
from featuretools.feature_base import (
    AggregationFeature,
    DirectFeature,
    Feature,
    FeatureBase,
    GroupByTransformFeature,
    IdentityFeature,
    TransformFeature,
    graph_feature,
    describe_feature,
    save_features,
    load_features,
)

import logging
import pkg_resources
import sys
import traceback
import warnings
from woodwork import list_logical_types, list_semantic_tags

logger = logging.getLogger("featuretools")

# Call functions registered by other libraries when featuretools is imported
for entry_point in pkg_resources.iter_entry_points("featuretools_initialize"):
    try:
        method = entry_point.load()
        if callable(method):
            method()
    except Exception:
        pass
for entry_point in pkg_resources.iter_entry_points("alteryx_open_src_initialize"):
    try:
        method = entry_point.load()
        if callable(method):
            method("featuretools")
    except Exception:
        pass

# Load in submodules registered by other libraries into Featuretools namespace
for entry_point in pkg_resources.iter_entry_points("featuretools_plugin"):
    try:
        sys.modules["featuretools." + entry_point.name] = entry_point.load()
    except Exception:
        message = "Featuretools failed to load plugin {} from library {}. "
        message += "For a full stack trace, set logging to debug."
        logger.warning(message.format(entry_point.name, entry_point.module_name))
        logger.debug(traceback.format_exc())


================================================
FILE: featuretools/__main__.py
================================================


================================================
FILE: featuretools/computational_backends/__init__.py
================================================
# flake8: noqa
from featuretools.computational_backends.api import *


================================================
FILE: featuretools/computational_backends/api.py
================================================
# flake8: noqa
from featuretools.computational_backends.calculate_feature_matrix import (
    approximate_features,
    calculate_feature_matrix,
)
from featuretools.computational_backends.utils import (
    bin_cutoff_times,
    create_client_and_cluster,
    replace_inf_values,
)


================================================
FILE: featuretools/computational_backends/calculate_feature_matrix.py
================================================
import logging
import math
import os
import shutil
import time
import warnings
from datetime import datetime

import cloudpickle
import numpy as np
import pandas as pd
from woodwork.logical_types import (
    Age,
    AgeNullable,
    Boolean,
    BooleanNullable,
    Integer,
    IntegerNullable,
)

from featuretools.computational_backends.feature_set import FeatureSet
from featuretools.computational_backends.feature_set_calculator import (
    FeatureSetCalculator,
)
from featuretools.computational_backends.utils import (
    _check_cutoff_time_type,
    _validate_cutoff_time,
    bin_cutoff_times,
    create_client_and_cluster,
    gather_approximate_features,
    gen_empty_approx_features_df,
    get_ww_types_from_features,
    save_csv_decorator,
)
from featuretools.entityset.relationship import RelationshipPath
from featuretools.feature_base import AggregationFeature, FeatureBase
from featuretools.utils import Trie
from featuretools.utils.gen_utils import (
    import_or_raise,
    make_tqdm_iterator,
)

logger = logging.getLogger("featuretools.computational_backend")

PBAR_FORMAT = "Elapsed: {elapsed} | Progress: {l_bar}{bar}"
FEATURE_CALCULATION_PERCENTAGE = (
    0.95  # make total 5% higher to allot time for wrapping up at end
)


def calculate_feature_matrix(
    features,
    entityset=None,
    cutoff_time=None,
    instance_ids=None,
    dataframes=None,
    relationships=None,
    cutoff_time_in_index=False,
    training_window=None,
    approximate=None,
    save_progress=None,
    verbose=False,
    chunk_size=None,
    n_jobs=1,
    dask_kwargs=None,
    progress_callback=None,
    include_cutoff_time=True,
):
    """Calculates a matrix for a given set of instance ids and calculation times.

    Args:
        features (list[:class:`.FeatureBase`]): Feature definitions to be calculated.

        entityset (EntitySet): An already initialized entityset. Required if `dataframes` and `relationships`
            not provided

        cutoff_time (pd.DataFrame or Datetime): Specifies times at which to calculate
            the features for each instance. The resulting feature matrix will use data
            up to and including the cutoff_time. Can either be a DataFrame or a single
            value. If a DataFrame is passed the instance ids for which to calculate features
            must be in a column with the same name as the target dataframe index or a column
            named `instance_id`. The cutoff time values in the DataFrame must be in a column with
            the same name as the target dataframe time index or a column named `time`. If the
            DataFrame has more than two columns, any additional columns will be added to the
            resulting feature matrix. If a single value is passed, this value will be used for
            all instances.

        instance_ids (list): List of instances to calculate features on. Only
            used if cutoff_time is a single datetime.

        dataframes (dict[str -> tuple(DataFrame, str, str, dict[str -> str/Woodwork.LogicalType], dict[str->str/set], boolean)]):
            Dictionary of DataFrames. Entries take the format
            {dataframe name -> (dataframe, index column, time_index, logical_types, semantic_tags, make_index)}.
            Note that only the dataframe is required. If a Woodwork DataFrame is supplied, any other parameters
            will be ignored.

        relationships (list[(str, str, str, str)]): list of relationships
            between dataframes. List items are a tuple with the format
            (parent dataframe name, parent column, child dataframe name, child column).

        cutoff_time_in_index (bool): If True, return a DataFrame with a MultiIndex
            where the second index is the cutoff time (first is instance id).
            DataFrame will be sorted by (time, instance_id).

        training_window (Timedelta or str, optional):
            Window defining how much time before the cutoff time data
            can be used when calculating features. If ``None``, all data before cutoff time is used.
            Defaults to ``None``.

        approximate (Timedelta or str): Frequency to group instances with similar
            cutoff times by for features with costly calculations. For example,
            if bucket is 24 hours, all instances with cutoff times on the same
            day will use the same calculation for expensive features.

        verbose (bool, optional): Print progress info. The time granularity is
            per chunk.

        chunk_size (int or float or None): maximum number of rows of
            output feature matrix to calculate at time. If passed an integer
            greater than 0, will try to use that many rows per chunk. If passed
            a float value between 0 and 1 sets the chunk size to that
            percentage of all rows. if None, and n_jobs > 1 it will be set to 1/n_jobs

        n_jobs (int, optional): number of parallel processes to use when
            calculating feature matrix. Requires Dask if not equal to 1.

        dask_kwargs (dict, optional): Dictionary of keyword arguments to be
            passed when creating the dask client and scheduler. Even if n_jobs
            is not set, using `dask_kwargs` will enable multiprocessing.
            Main parameters:

            cluster (str or dask.distributed.LocalCluster):
                cluster or address of cluster to send tasks to. If unspecified,
                a cluster will be created.
            diagnostics port (int):
                port number to use for web dashboard.  If left unspecified, web
                interface will not be enabled.

            Valid keyword arguments for LocalCluster will also be accepted.

        save_progress (str, optional): path to save intermediate computational results.

        progress_callback (callable): function to be called with incremental progress updates.
            Has the following parameters:

                update: percentage change (float between 0 and 100) in progress since last call
                progress_percent: percentage (float between 0 and 100) of total computation completed
                time_elapsed: total time in seconds that has elapsed since start of call

        include_cutoff_time (bool): Include data at cutoff times in feature calculations. Defaults to ``True``.

    Returns:
        pd.DataFrame: The feature matrix.
    """
    assert (
        isinstance(features, list)
        and features != []
        and all([isinstance(feature, FeatureBase) for feature in features])
    ), "features must be a non-empty list of features"

    # handle loading entityset
    from featuretools.entityset.entityset import EntitySet

    if not isinstance(entityset, EntitySet):
        if dataframes is not None:
            entityset = EntitySet("entityset", dataframes, relationships)
        else:
            raise TypeError("No dataframes or valid EntitySet provided")

    target_dataframe = entityset[features[0].dataframe_name]

    cutoff_time = _validate_cutoff_time(cutoff_time, target_dataframe)
    entityset._check_time_indexes()

    if isinstance(cutoff_time, pd.DataFrame):
        if instance_ids:
            msg = "Passing 'instance_ids' is valid only if 'cutoff_time' is a single value or None - ignoring"
            warnings.warn(msg)
        pass_columns = [
            col for col in cutoff_time.columns if col not in ["instance_id", "time"]
        ]
        # make sure dtype of instance_id in cutoff time
        # is same as column it references
        target_dataframe = features[0].dataframe
        ltype = target_dataframe.ww.logical_types[target_dataframe.ww.index]
        cutoff_time.ww.init(logical_types={"instance_id": ltype})
    else:
        pass_columns = []
        if cutoff_time is None:
            if entityset.time_type == "numeric":
                cutoff_time = np.inf
            else:
                cutoff_time = datetime.now()

        if instance_ids is None:
            index_col = target_dataframe.ww.index
            df = entityset._handle_time(
                dataframe_name=target_dataframe.ww.name,
                df=target_dataframe,
                time_last=cutoff_time,
                training_window=training_window,
                include_cutoff_time=include_cutoff_time,
            )
            instance_ids = df[index_col]

        # convert list or range object into series
        if not isinstance(instance_ids, pd.Series):
            instance_ids = pd.Series(instance_ids)

        cutoff_time = (cutoff_time, instance_ids)

    _check_cutoff_time_type(cutoff_time, entityset.time_type)

    # Approximate provides no benefit with a single cutoff time, so ignore it
    if isinstance(cutoff_time, tuple) and approximate is not None:
        msg = (
            "Using approximate with a single cutoff_time value or no cutoff_time "
            "provides no computational efficiency benefit"
        )
        warnings.warn(msg)
        cutoff_time = pd.DataFrame(
            {
                "instance_id": cutoff_time[1],
                "time": [cutoff_time[0]] * len(cutoff_time[1]),
            },
        )
        target_dataframe = features[0].dataframe
        ltype = target_dataframe.ww.logical_types[target_dataframe.ww.index]
        cutoff_time.ww.init(logical_types={"instance_id": ltype})

    feature_set = FeatureSet(features)

    # Get features to approximate
    if approximate is not None:
        approximate_feature_trie = gather_approximate_features(feature_set)
        # Make a new FeatureSet that ignores approximated features
        feature_set = FeatureSet(
            features,
            approximate_feature_trie=approximate_feature_trie,
        )

    # Check if there are any non-approximated aggregation features
    no_unapproximated_aggs = True
    for feature in features:
        if isinstance(feature, AggregationFeature):
            # do not need to check if feature is in to_approximate since
            # only base features of direct features can be in to_approximate
            no_unapproximated_aggs = False
            break

        if approximate is not None:
            all_approx_features = {
                f for _, feats in feature_set.approximate_feature_trie for f in feats
            }
        else:
            all_approx_features = set()
        deps = feature.get_dependencies(deep=True, ignored=all_approx_features)
        for dependency in deps:
            if isinstance(dependency, AggregationFeature):
                no_unapproximated_aggs = False
                break

    cutoff_df_time_col = "time"
    target_time = "_original_time"

    if approximate is not None:
        # If there are approximated aggs, bin times
        binned_cutoff_time = bin_cutoff_times(cutoff_time, approximate)

        # Think about collisions: what if original time is a feature
        binned_cutoff_time.ww[target_time] = cutoff_time[cutoff_df_time_col]

        cutoff_time_to_pass = binned_cutoff_time

    else:
        cutoff_time_to_pass = cutoff_time

    if isinstance(cutoff_time, pd.DataFrame):
        cutoff_time_len = cutoff_time.shape[0]
    else:
        cutoff_time_len = len(cutoff_time[1])

    chunk_size = _handle_chunk_size(chunk_size, cutoff_time_len)
    tqdm_options = {
        "total": (cutoff_time_len / FEATURE_CALCULATION_PERCENTAGE),
        "bar_format": PBAR_FORMAT,
        "disable": True,
    }

    if verbose:
        tqdm_options.update({"disable": False})
    elif progress_callback is not None:
        # allows us to utilize progress_bar updates without printing to anywhere
        tqdm_options.update({"file": open(os.devnull, "w"), "disable": False})

    with make_tqdm_iterator(**tqdm_options) as progress_bar:
        if n_jobs != 1 or dask_kwargs is not None:
            feature_matrix = parallel_calculate_chunks(
                cutoff_time=cutoff_time_to_pass,
                chunk_size=chunk_size,
                feature_set=feature_set,
                approximate=approximate,
                training_window=training_window,
                save_progress=save_progress,
                entityset=entityset,
                n_jobs=n_jobs,
                no_unapproximated_aggs=no_unapproximated_aggs,
                cutoff_df_time_col=cutoff_df_time_col,
                target_time=target_time,
                pass_columns=pass_columns,
                progress_bar=progress_bar,
                dask_kwargs=dask_kwargs or {},
                progress_callback=progress_callback,
                include_cutoff_time=include_cutoff_time,
            )
        else:
            feature_matrix = calculate_chunk(
                cutoff_time=cutoff_time_to_pass,
                chunk_size=chunk_size,
                feature_set=feature_set,
                approximate=approximate,
                training_window=training_window,
                save_progress=save_progress,
                entityset=entityset,
                no_unapproximated_aggs=no_unapproximated_aggs,
                cutoff_df_time_col=cutoff_df_time_col,
                target_time=target_time,
                pass_columns=pass_columns,
                progress_bar=progress_bar,
                progress_callback=progress_callback,
                include_cutoff_time=include_cutoff_time,
            )

        # ensure rows are sorted by input order
        if isinstance(cutoff_time, pd.DataFrame):
            feature_matrix = feature_matrix.ww.reindex(
                pd.MultiIndex.from_frame(
                    cutoff_time[["instance_id", "time"]],
                    names=feature_matrix.index.names,
                ),
            )
        else:
            # Maintain index dtype
            index_dtype = feature_matrix.index.get_level_values(0).dtype
            feature_matrix = feature_matrix.ww.reindex(
                cutoff_time[1].astype(index_dtype),
                level=0,
            )
        if not cutoff_time_in_index:
            feature_matrix.ww.reset_index(level="time", drop=True, inplace=True)

        if save_progress and os.path.exists(os.path.join(save_progress, "temp")):
            shutil.rmtree(os.path.join(save_progress, "temp"))

        # force to 100% since we saved last 5 percent
        previous_progress = progress_bar.n
        progress_bar.update(progress_bar.total - progress_bar.n)

        if progress_callback is not None:
            (
                update,
                progress_percent,
                time_elapsed,
            ) = update_progress_callback_parameters(progress_bar, previous_progress)
            progress_callback(update, progress_percent, time_elapsed)

        progress_bar.refresh()

    return feature_matrix


def calculate_chunk(
    cutoff_time,
    chunk_size,
    feature_set,
    entityset,
    approximate,
    training_window,
    save_progress,
    no_unapproximated_aggs,
    cutoff_df_time_col,
    target_time,
    pass_columns,
    progress_bar=None,
    progress_callback=None,
    include_cutoff_time=True,
    schema=None,
):
    if not isinstance(feature_set, FeatureSet):
        feature_set = cloudpickle.loads(feature_set)  # pragma: no cover

    feature_matrix = []
    if no_unapproximated_aggs and approximate is not None:
        if entityset.time_type == "numeric":
            group_time = np.inf
        else:
            group_time = datetime.now()

    if isinstance(cutoff_time, tuple):
        update_progress_callback = None
        if progress_bar is not None:

            def update_progress_callback(done):
                previous_progress = progress_bar.n
                progress_bar.update(done * len(cutoff_time[1]))
                if progress_callback is not None:
                    (
                        update,
                        progress_percent,
                        time_elapsed,
                    ) = update_progress_callback_parameters(
                        progress_bar,
                        previous_progress,
                    )
                    progress_callback(update, progress_percent, time_elapsed)

        time_last = cutoff_time[0]
        ids = cutoff_time[1]
        calculator = FeatureSetCalculator(
            entityset,
            feature_set,
            time_last,
            training_window=training_window,
        )
        _feature_matrix = calculator.run(
            ids,
            progress_callback=update_progress_callback,
            include_cutoff_time=include_cutoff_time,
        )
        time_index = pd.Index([time_last] * len(ids), name="time")
        _feature_matrix = _feature_matrix.set_index(time_index, append=True)
        feature_matrix.append(_feature_matrix)

    else:
        if schema:
            cutoff_time.ww.init_with_full_schema(schema=schema)  # pragma: no cover
        for _, group in cutoff_time.groupby(cutoff_df_time_col):
            # if approximating, calculate the approximate features
            if approximate is not None:
                group.ww.init(schema=cutoff_time.ww.schema, validate=False)
                precalculated_features_trie = approximate_features(
                    feature_set,
                    group,
                    window=approximate,
                    entityset=entityset,
                    training_window=training_window,
                    include_cutoff_time=include_cutoff_time,
                )
            else:
                precalculated_features_trie = None

            @save_csv_decorator(save_progress)
            def calc_results(
                time_last,
                ids,
                precalculated_features=None,
                training_window=None,
                include_cutoff_time=True,
            ):
                update_progress_callback = None

                if progress_bar is not None:

                    def update_progress_callback(done):
                        previous_progress = progress_bar.n
                        progress_bar.update(done * group.shape[0])
                        if progress_callback is not None:
                            (
                                update,
                                progress_percent,
                                time_elapsed,
                            ) = update_progress_callback_parameters(
                                progress_bar,
                                previous_progress,
                            )
                            progress_callback(update, progress_percent, time_elapsed)

                calculator = FeatureSetCalculator(
                    entityset,
                    feature_set,
                    time_last,
                    training_window=training_window,
                    precalculated_features=precalculated_features,
                )
                matrix = calculator.run(
                    ids,
                    progress_callback=update_progress_callback,
                    include_cutoff_time=include_cutoff_time,
                )

                return matrix

            # if all aggregations have been approximated, can calculate all together
            if no_unapproximated_aggs and approximate is not None:
                inner_grouped = [[group_time, group]]
            else:
                # if approximated features, set cutoff_time to unbinned time
                if precalculated_features_trie is not None:
                    group[cutoff_df_time_col] = group[target_time]

                inner_grouped = group.groupby(cutoff_df_time_col, sort=True)

            if chunk_size is not None:
                inner_grouped = _chunk_dataframe_groups(inner_grouped, chunk_size)

            for time_last, group in inner_grouped:
                # sort group by instance id
                ids = group["instance_id"].sort_values().values
                if no_unapproximated_aggs and approximate is not None:
                    window = None
                else:
                    window = training_window

                # calculate values for those instances at time time_last
                _feature_matrix = calc_results(
                    time_last,
                    ids,
                    precalculated_features=precalculated_features_trie,
                    training_window=window,
                    include_cutoff_time=include_cutoff_time,
                )

                id_name = _feature_matrix.index.name

                # if approximate, merge feature matrix with group frame to get original
                # cutoff times and passed columns
                if approximate:
                    cols = [c for c in _feature_matrix.columns if c not in pass_columns]
                    indexer = group[["instance_id", target_time] + pass_columns]
                    _feature_matrix = _feature_matrix[cols].merge(
                        indexer,
                        right_on=["instance_id"],
                        left_index=True,
                        how="right",
                    )
                    _feature_matrix.set_index(
                        ["instance_id", target_time],
                        inplace=True,
                    )
                    _feature_matrix.index.set_names([id_name, "time"], inplace=True)
                    _feature_matrix.sort_index(level=1, kind="mergesort", inplace=True)
                else:
                    # all rows have same cutoff time. set time and add passed columns
                    num_rows = len(ids)
                    if len(pass_columns) > 0:
                        pass_through = group[
                            ["instance_id", cutoff_df_time_col] + pass_columns
                        ]
                        pass_through.rename(
                            columns={
                                "instance_id": id_name,
                                cutoff_df_time_col: "time",
                            },
                            inplace=True,
                        )

                    time_index = pd.Index([time_last] * num_rows, name="time")
                    _feature_matrix = _feature_matrix.set_index(
                        time_index,
                        append=True,
                    )
                    if len(pass_columns) > 0:
                        pass_through.set_index([id_name, "time"], inplace=True)
                        for col in pass_columns:
                            _feature_matrix[col] = pass_through[col]
                feature_matrix.append(_feature_matrix)

    ww_init_kwargs = get_ww_types_from_features(
        feature_set.target_features,
        entityset,
        pass_columns,
        cutoff_time,
    )
    feature_matrix = init_ww_and_concat_fm(feature_matrix, ww_init_kwargs)
    return feature_matrix


def approximate_features(
    feature_set,
    cutoff_time,
    window,
    entityset,
    training_window=None,
    include_cutoff_time=True,
):
    """Given a set of features and cutoff_times to be passed to
    calculate_feature_matrix, calculates approximate values of some features
    to speed up calculations.  Cutoff times are sorted into
    window-sized buckets and the approximate feature values are only calculated
    at one cutoff time for each bucket.


    ..note:: this only approximates DirectFeatures of AggregationFeatures, on
        the target dataframe. In future versions, it may also be possible to
        approximate these features on other top-level dataframes

    Args:
        cutoff_time (pd.DataFrame): specifies what time to calculate
            the features for each instance at. The resulting feature matrix will use data
            up to and including the cutoff_time. A DataFrame with
            'instance_id' and 'time' columns.

        window (Timedelta or str): frequency to group instances with similar
            cutoff times by for features with costly calculations. For example,
            if bucket is 24 hours, all instances with cutoff times on the same
            day will use the same calculation for expensive features.

        entityset (:class:`.EntitySet`): An already initialized entityset.

        feature_set (:class:`.FeatureSet`): The features to be calculated.

        training_window (`Timedelta`, optional):
            Window defining how much older than the cutoff time data
            can be to be included when calculating the feature. If None, all older data is used.

        include_cutoff_time (bool):
            If True, data at cutoff times are included in feature calculations.

    """
    approx_fms_trie = Trie(path_constructor=RelationshipPath)

    target_time_colname = "target_time"
    cutoff_time.ww[target_time_colname] = cutoff_time["time"]
    approx_cutoffs = bin_cutoff_times(cutoff_time, window)
    cutoff_df_time_col = "time"
    cutoff_df_instance_col = "instance_id"
    # should this order be by dependencies so that calculate_feature_matrix
    # doesn't skip approximating something?
    for relationship_path, approx_feature_names in feature_set.approximate_feature_trie:
        if not approx_feature_names:
            continue

        (
            cutoffs_with_approx_e_ids,
            new_approx_dataframe_index_col,
        ) = _add_approx_dataframe_index_col(
            entityset,
            feature_set.target_df_name,
            approx_cutoffs.copy(),
            relationship_path,
        )

        # Select only columns we care about
        columns_we_want = [
            new_approx_dataframe_index_col,
            cutoff_df_time_col,
            target_time_colname,
        ]

        cutoffs_with_approx_e_ids = cutoffs_with_approx_e_ids[columns_we_want]
        cutoffs_with_approx_e_ids = cutoffs_with_approx_e_ids.drop_duplicates()
        cutoffs_with_approx_e_ids.dropna(
            subset=[new_approx_dataframe_index_col],
            inplace=True,
        )

        approx_features = [
            feature_set.features_by_name[name] for name in approx_feature_names
        ]
        if cutoffs_with_approx_e_ids.empty:
            approx_fm = gen_empty_approx_features_df(approx_features)
        else:
            cutoffs_with_approx_e_ids.sort_values(
                [cutoff_df_time_col, new_approx_dataframe_index_col],
                inplace=True,
            )
            # CFM assumes specific column names for cutoff_time argument
            rename = {new_approx_dataframe_index_col: cutoff_df_instance_col}
            cutoff_time_to_pass = cutoffs_with_approx_e_ids.rename(columns=rename)
            cutoff_time_to_pass = cutoff_time_to_pass[
                [cutoff_df_instance_col, cutoff_df_time_col]
            ]

            cutoff_time_to_pass.drop_duplicates(inplace=True)
            approx_fm = calculate_feature_matrix(
                approx_features,
                entityset,
                cutoff_time=cutoff_time_to_pass,
                training_window=training_window,
                approximate=None,
                cutoff_time_in_index=False,
                chunk_size=cutoff_time_to_pass.shape[0],
                include_cutoff_time=include_cutoff_time,
            )

        approx_fms_trie.get_node(relationship_path).value = approx_fm

    return approx_fms_trie


def scatter_warning(num_scattered_workers, num_workers):
    if num_scattered_workers != num_workers:
        scatter_warning = "EntitySet was only scattered to {} out of {} workers"
        logger.warning(scatter_warning.format(num_scattered_workers, num_workers))


def parallel_calculate_chunks(
    cutoff_time,
    chunk_size,
    feature_set,
    approximate,
    training_window,
    save_progress,
    entityset,
    n_jobs,
    no_unapproximated_aggs,
    cutoff_df_time_col,
    target_time,
    pass_columns,
    progress_bar,
    dask_kwargs=None,
    progress_callback=None,
    include_cutoff_time=True,
):
    import_or_raise(
        "distributed",
        "Dask must be installed to calculate feature matrix with n_jobs set to anything but 1",
    )
    from dask.base import tokenize
    from distributed import Future, as_completed

    client = None
    cluster = None
    try:
        client, cluster = create_client_and_cluster(
            n_jobs=n_jobs,
            dask_kwargs=dask_kwargs,
            entityset_size=entityset.__sizeof__(),
        )
        # scatter the entityset
        # denote future with leading underscore
        start = time.time()
        es_token = "EntitySet-{}".format(tokenize(entityset))
        if es_token in client.list_datasets():
            msg = "Using EntitySet persisted on the cluster as dataset {}"
            progress_bar.write(msg.format(es_token))
            _es = client.get_dataset(es_token)
        else:
            _es = client.scatter([entityset])[0]
            client.publish_dataset(**{_es.key: _es})

        # save features to a tempfile and scatter it
        pickled_feats = cloudpickle.dumps(feature_set)
        _saved_features = client.scatter(pickled_feats)
        client.replicate([_es, _saved_features])
        num_scattered_workers = len(
            client.who_has([Future(es_token)]).get(es_token, []),
        )
        num_workers = len(client.scheduler_info()["workers"].values())

        schema = None
        if isinstance(cutoff_time, pd.DataFrame):
            schema = cutoff_time.ww.schema
            chunks = cutoff_time.groupby(cutoff_df_time_col)
            cutoff_time_len = cutoff_time.shape[0]
        else:
            chunks = cutoff_time
            cutoff_time_len = len(cutoff_time[1])

        if not chunk_size:
            chunk_size = _handle_chunk_size(1.0 / num_workers, cutoff_time_len)

        chunks = _chunk_dataframe_groups(chunks, chunk_size)

        chunks = [df for _, df in chunks]

        if len(chunks) < num_workers:  # pragma: no cover
            chunk_warning = (
                "Fewer chunks ({}), than workers ({}) consider reducing the chunk size"
            )
            warning_string = chunk_warning.format(len(chunks), num_workers)
            progress_bar.write(warning_string)

        scatter_warning(num_scattered_workers, num_workers)
        end = time.time()
        scatter_time = round(end - start)

        # if enabled, reset timer after scatter for better time remaining estimates
        if not progress_bar.disable:
            progress_bar.reset()

        scatter_string = "EntitySet scattered to {} workers in {} seconds"
        progress_bar.write(scatter_string.format(num_scattered_workers, scatter_time))
        # map chunks
        # TODO: consider handling task submission dask kwargs
        _chunks = client.map(
            calculate_chunk,
            chunks,
            feature_set=_saved_features,
            chunk_size=None,
            entityset=_es,
            approximate=approximate,
            training_window=training_window,
            save_progress=save_progress,
            no_unapproximated_aggs=no_unapproximated_aggs,
            cutoff_df_time_col=cutoff_df_time_col,
            target_time=target_time,
            pass_columns=pass_columns,
            progress_bar=None,
            progress_callback=progress_callback,
            include_cutoff_time=include_cutoff_time,
            schema=schema,
        )

        feature_matrix = []
        iterator = as_completed(_chunks).batches()
        for batch in iterator:
            results = client.gather(batch)
            for result in results:
                feature_matrix.append(result)
                previous_progress = progress_bar.n
                progress_bar.update(result.shape[0])
                if progress_callback is not None:
                    (
                        update,
                        progress_percent,
                        time_elapsed,
                    ) = update_progress_callback_parameters(
                        progress_bar,
                        previous_progress,
                    )
                    progress_callback(update, progress_percent, time_elapsed)

    except Exception:
        raise
    finally:
        if client is not None:
            client.close()

        if "cluster" not in dask_kwargs and cluster is not None:
            cluster.close()  # pragma: no cover

    ww_init_kwargs = get_ww_types_from_features(
        feature_set.target_features,
        entityset,
        pass_columns,
        cutoff_time,
    )
    feature_matrix = init_ww_and_concat_fm(feature_matrix, ww_init_kwargs)
    return feature_matrix


def _add_approx_dataframe_index_col(es, target_dataframe_name, cutoffs, path):
    """
    Add a column to the cutoff df linking it to the dataframe at the end of the
    path.

    Return the updated cutoff df and the name of this column. The name will
    consist of the columns which were joined through.
    """
    last_child_col = "instance_id"
    last_parent_col = es[target_dataframe_name].ww.index

    for _, relationship in path:
        child_cols = [last_parent_col, relationship._child_column_name]
        child_df = es[relationship.child_name][child_cols]

        # Rename relationship.child_column to include the columns we have
        # joined through.
        new_col_name = "%s.%s" % (last_child_col, relationship._child_column_name)
        to_rename = {relationship._child_column_name: new_col_name}
        child_df = child_df.rename(columns=to_rename)
        cutoffs = cutoffs.merge(
            child_df,
            left_on=last_child_col,
            right_on=last_parent_col,
        )

        # These will be used in the next iteration.
        last_child_col = new_col_name
        last_parent_col = relationship._parent_column_name

    return cutoffs, new_col_name


def _chunk_dataframe_groups(grouped, chunk_size):
    """chunks a grouped dataframe into groups no larger than chunk_size"""
    if isinstance(grouped, tuple):
        for i in range(0, len(grouped[1]), chunk_size):
            yield None, (grouped[0], grouped[1].iloc[i : i + chunk_size])
    else:
        for group_key, group_df in grouped:
            for i in range(0, len(group_df), chunk_size):
                yield group_key, group_df.iloc[i : i + chunk_size]


def _handle_chunk_size(chunk_size, total_size):
    if chunk_size is not None:
        assert chunk_size > 0, "Chunk size must be greater than 0"

        if chunk_size < 1:
            chunk_size = math.ceil(chunk_size * total_size)

        chunk_size = int(chunk_size)

    return chunk_size


def update_progress_callback_parameters(progress_bar, previous_progress):
    update = (progress_bar.n - previous_progress) / progress_bar.total * 100
    progress_percent = (progress_bar.n / progress_bar.total) * 100
    time_elapsed = progress_bar.format_dict["elapsed"]
    return (update, progress_percent, time_elapsed)


def init_ww_and_concat_fm(feature_matrix, ww_init_kwargs):
    cols_to_check = {
        col
        for col, ltype in ww_init_kwargs["logical_types"].items()
        if isinstance(ltype, (Age, Boolean, Integer))
    }
    replacement_type = {
        "age": AgeNullable(),
        "boolean": BooleanNullable(),
        "integer": IntegerNullable(),
    }
    for fm in feature_matrix:
        updated_cols = set()
        for col in cols_to_check:
            # Only convert types if null values are present
            if fm[col].isnull().any():
                current_type = ww_init_kwargs["logical_types"][col].type_string
                ww_init_kwargs["logical_types"][col] = replacement_type[current_type]
                updated_cols.add(col)
        cols_to_check = cols_to_check - updated_cols
        fm.ww.init(**ww_init_kwargs)

    feature_matrix = pd.concat(feature_matrix)

    feature_matrix.ww.init(**ww_init_kwargs)
    return feature_matrix


================================================
FILE: featuretools/computational_backends/feature_set.py
================================================
import itertools
import logging
from collections import defaultdict

from featuretools.entityset.relationship import RelationshipPath
from featuretools.feature_base import (
    AggregationFeature,
    FeatureOutputSlice,
    GroupByTransformFeature,
    TransformFeature,
)
from featuretools.utils import Trie

logger = logging.getLogger("featuretools.computational_backend")


class FeatureSet(object):
    """
    Represents an immutable set of features to be calculated for a single dataframe, and their
    dependencies.
    """

    def __init__(self, features, approximate_feature_trie=None):
        """
        Args:
            features (list[Feature]): Features of the target dataframe.
            approximate_feature_trie (Trie[RelationshipPath, set[str]], optional): Dependency
                features to ignore because they have already been approximated. For example, if
                one of the target features is a direct feature of a feature A and A is included in
                approximate_feature_trie then neither A nor its dependencies will appear in
                FeatureSet.feature_trie.
        """
        self.target_df_name = features[0].dataframe_name
        self.target_features = features
        self.target_feature_names = {f.unique_name() for f in features}

        if not approximate_feature_trie:
            approximate_feature_trie = Trie(
                default=list,
                path_constructor=RelationshipPath,
            )
        self.approximate_feature_trie = approximate_feature_trie

        # Maps the unique name of each feature to the actual feature. This is necessary
        # because features do not support equality and so cannot be used as
        # dictionary keys. The equality operator on features produces a new
        # feature (which will always be truthy).
        self.features_by_name = {f.unique_name(): f for f in features}

        feature_dependents = defaultdict(set)
        for f in features:
            deps = f.get_dependencies(deep=True)
            for dep in deps:
                feature_dependents[dep.unique_name()].add(f.unique_name())
                self.features_by_name[dep.unique_name()] = dep
                subdeps = dep.get_dependencies(deep=True)
                for sd in subdeps:
                    feature_dependents[sd.unique_name()].add(dep.unique_name())

        # feature names (keys) and the features that rely on them (values).
        self.feature_dependents = {
            fname: [self.features_by_name[dname] for dname in feature_dependents[fname]]
            for fname, f in self.features_by_name.items()
        }

        self._feature_trie = None

    @property
    def feature_trie(self):
        """
        The target features and their dependencies organized into a trie by relationship path.
        This is built once when it is first called (to avoid building it if it is not needed) and
        then used for all subsequent calls.

        The edges of the trie are RelationshipPaths and the values are tuples of
        (bool, set[str], set[str]). The bool represents whether the full dataframe is needed at
        that node, the first set contains the names of features which are needed on the full
        dataframe, and the second set contains the names of the rest of the features

        Returns:
            Trie[RelationshipPath, (bool, set[str], set[str])]
        """
        if not self._feature_trie:
            self._feature_trie = self._build_feature_trie()

        return self._feature_trie

    def _build_feature_trie(self):
        """
        Build the feature trie by adding the target features and their dependencies recursively.
        """
        feature_trie = Trie(
            default=lambda: (False, set(), set()),
            path_constructor=RelationshipPath,
        )

        for f in self.target_features:
            self._add_feature_to_trie(feature_trie, f, self.approximate_feature_trie)

        return feature_trie

    def _add_feature_to_trie(
        self,
        trie,
        feature,
        approximate_feature_trie,
        ancestor_needs_full_dataframe=False,
    ):
        """
        Add the given feature to the root of the trie, and recurse on its dependencies. If it is in
        approximate_feature_trie then it will not be added and we will not recurse on its dependencies.
        """
        node_needs_full_dataframe, full_features, not_full_features = trie.value
        needs_full_dataframe = (
            ancestor_needs_full_dataframe or self.uses_full_dataframe(feature)
        )

        name = feature.unique_name()

        # If this feature is ignored then don't add it or any of its dependencies.
        if name in approximate_feature_trie.value:
            return

        # Add the feature to one of the sets, depending on whether it needs the full dataframe.
        if needs_full_dataframe:
            full_features.add(name)
            if name in not_full_features:
                not_full_features.remove(name)

            # Update needs_full_dataframe for this node.
            trie.value = (True, full_features, not_full_features)

            # Set every node in relationship path to needs_full_dataframe.
            sub_trie = trie
            for edge in feature.relationship_path:
                sub_trie = sub_trie.get_node([edge])
                (_, f1, f2) = sub_trie.value
                sub_trie.value = (True, f1, f2)
        else:
            if name not in full_features:
                not_full_features.add(name)

            sub_trie = trie.get_node(feature.relationship_path)

        sub_ignored_trie = approximate_feature_trie.get_node(feature.relationship_path)

        for dep_feat in feature.get_dependencies():
            if isinstance(dep_feat, FeatureOutputSlice):
                dep_feat = dep_feat.base_feature
            self._add_feature_to_trie(
                sub_trie,
                dep_feat,
                sub_ignored_trie,
                ancestor_needs_full_dataframe=needs_full_dataframe,
            )

    def group_features(self, feature_names):
        """
        Topologically sort the given features, then group by path,
        feature type, use_previous, and where.
        """
        features = [self.features_by_name[name] for name in feature_names]
        depths = self._get_feature_depths(features)

        def key_func(f):
            return (
                depths[f.unique_name()],
                f.relationship_path_name(),
                str(f.__class__),
                _get_use_previous(f),
                _get_where(f),
                self.uses_full_dataframe(f),
                _get_groupby(f),
            )

        # Sort the list of features by the complex key function above, then
        # group them by the same key
        sort_feats = sorted(features, key=key_func)
        feature_groups = [
            list(g) for _, g in itertools.groupby(sort_feats, key=key_func)
        ]

        return feature_groups

    def _get_feature_depths(self, features):
        """
        Generate and return a mapping of {feature name -> depth} in the
        feature DAG for the given dataframe.
        """
        order = defaultdict(int)
        depths = {}
        queue = features[:]
        while queue:
            # Get the next feature.
            f = queue.pop(0)

            depths[f.unique_name()] = order[f.unique_name()]

            # Only look at dependencies if they are on the same dataframe.
            if not f.relationship_path:
                dependencies = f.get_dependencies()
                for dep in dependencies:
                    order[dep.unique_name()] = min(
                        order[f.unique_name()] - 1,
                        order[dep.unique_name()],
                    )
                    queue.append(dep)

        return depths

    def uses_full_dataframe(self, feature, check_dependents=False):
        if (
            isinstance(feature, TransformFeature)
            and feature.primitive.uses_full_dataframe
        ):
            return True
        return check_dependents and self._dependent_uses_full_dataframe(feature)

    def _dependent_uses_full_dataframe(self, feature):
        for d in self.feature_dependents[feature.unique_name()]:
            if isinstance(d, TransformFeature) and d.primitive.uses_full_dataframe:
                return True
        return False


# These functions are used for sorting and grouping features


def _get_use_previous(
    f,
):  # TODO Sort and group features for DateOffset with two different temporal values
    if isinstance(f, AggregationFeature) and f.use_previous is not None:
        if len(f.use_previous.times.keys()) > 1:
            return ("", -1)
        else:
            unit = list(f.use_previous.times.keys())[0]
            value = f.use_previous.times[unit]
            return (unit, value)
    else:
        return ("", -1)


def _get_where(f):
    if isinstance(f, AggregationFeature) and f.where is not None:
        return f.where.unique_name()
    else:
        return ""


def _get_groupby(f):
    if isinstance(f, GroupByTransformFeature):
        return f.groupby.unique_name()
    else:
        return ""


================================================
FILE: featuretools/computational_backends/feature_set_calculator.py
================================================
from datetime import datetime
from functools import partial

import numpy as np
import pandas as pd
import pandas.api.types as pdtypes

from featuretools.entityset.relationship import RelationshipPath
from featuretools.exceptions import UnknownFeature
from featuretools.feature_base import (
    AggregationFeature,
    DirectFeature,
    GroupByTransformFeature,
    IdentityFeature,
    TransformFeature,
)
from featuretools.utils import Trie
from featuretools.utils.gen_utils import get_relationship_column_id


class FeatureSetCalculator(object):
    """
    Calculates the values of a set of features for given instance ids.
    """

    def __init__(
        self,
        entityset,
        feature_set,
        time_last=None,
        training_window=None,
        precalculated_features=None,
    ):
        """
        Args:
            feature_set (FeatureSet): The features to calculate values for.

            time_last (pd.Timestamp, optional): Last allowed time. Data from exactly this
                time not allowed.

            training_window (Timedelta, optional): Window defining how much time before the cutoff time data
                can be used when calculating features. If None, all data before cutoff time is used.

            precalculated_features (Trie[RelationshipPath -> pd.DataFrame]):
                Maps RelationshipPaths to dataframes of precalculated_features

        """
        self.entityset = entityset
        self.feature_set = feature_set
        self.training_window = training_window

        if time_last is None:
            time_last = datetime.now()

        self.time_last = time_last

        if precalculated_features is None:
            precalculated_features = Trie(path_constructor=RelationshipPath)

        self.precalculated_features = precalculated_features

        # total number of features (including dependencies) to be calculate
        self.num_features = sum(
            len(features1) + len(features2)
            for _, (_, features1, features2) in self.feature_set.feature_trie
        )

    def run(self, instance_ids, progress_callback=None, include_cutoff_time=True):
        """
        Calculate values of features for the given instances of the target
        dataframe.

        Summary of algorithm:
        1. Construct a trie where the edges are relationships and each node
            contains a set of features for a single dataframe. See
            FeatureSet._build_feature_trie.
        2. Initialize a trie for storing dataframes.
        3. Traverse the trie using depth first search. At each node calculate
            the features and store the resulting dataframe in the dataframe
            trie (so that its values can be used by features which depend on
            these features). See _calculate_features_for_dataframe.
        4. Get the dataframe at the root of the trie (for the target dataframe) and
            return the columns corresponding to the requested features.

        Args:
            instance_ids (np.ndarray or pd.Categorical): Instance ids for which
                to build features.

            progress_callback (callable): function to be called with incremental progress updates

            include_cutoff_time (bool): If True, data at cutoff time are included
                in calculating features.

        Returns:
            pd.DataFrame : Pandas DataFrame of calculated feature values.
                Indexed by instance_ids. Columns in same order as features
                passed in.
        """
        assert len(instance_ids) > 0, "0 instance ids provided"

        if progress_callback is None:
            # do nothing for the progress call back if not provided
            def progress_callback(*args):
                pass

        feature_trie = self.feature_set.feature_trie

        df_trie = Trie(path_constructor=RelationshipPath)
        full_dataframe_trie = Trie(path_constructor=RelationshipPath)

        target_dataframe = self.entityset[self.feature_set.target_df_name]

        self._calculate_features_for_dataframe(
            dataframe_name=self.feature_set.target_df_name,
            feature_trie=feature_trie,
            df_trie=df_trie,
            full_dataframe_trie=full_dataframe_trie,
            precalculated_trie=self.precalculated_features,
            filter_column=target_dataframe.ww.index,
            filter_values=instance_ids,
            progress_callback=progress_callback,
            include_cutoff_time=include_cutoff_time,
        )

        # The dataframe for the target dataframe should be stored at the root of
        # df_trie.
        df = df_trie.value

        # Fill in empty rows with default values.
        index_dtype = df.index.dtype.name
        if df.empty:
            return self.generate_default_df(instance_ids=instance_ids)

        missing_ids = [
            i for i in instance_ids if i not in df[target_dataframe.ww.index]
        ]
        if missing_ids:
            default_df = self.generate_default_df(
                instance_ids=missing_ids,
                extra_columns=df.columns,
            )

            df = pd.concat([df, default_df], sort=True)

        df.index.name = self.entityset[self.feature_set.target_df_name].ww.index

        # Order by instance_ids
        unique_instance_ids = pd.unique(instance_ids)
        unique_instance_ids = unique_instance_ids.astype(instance_ids.dtype)
        df = df.reindex(unique_instance_ids)

        # Keep categorical index if original index was categorical
        if index_dtype == "category":
            df.index = df.index.astype("category")

        column_list = []

        for feat in self.feature_set.target_features:
            column_list.extend(feat.get_feature_names())

        return df[column_list]

    def _calculate_features_for_dataframe(
        self,
        dataframe_name,
        feature_trie,
        df_trie,
        full_dataframe_trie,
        precalculated_trie,
        filter_column,
        filter_values,
        parent_data=None,
        progress_callback=None,
        include_cutoff_time=True,
    ):
        """
        Generate dataframes with features calculated for this node of the trie,
        and all descendant nodes. The dataframes will be stored in df_trie.

        Args:
            dataframe_name (str): The name of the dataframe to calculate features for.

            feature_trie (Trie): the trie with sets of features to calculate.
                The root contains features for the given dataframe.

            df_trie (Trie): a parallel trie for storing dataframes. The
                dataframe with features calculated will be placed in the root.

            full_dataframe_trie (Trie): a trie storing dataframes will all dataframe
                rows, for features that are uses_full_dataframe.

            precalculated_trie (Trie): a parallel trie containing dataframes
                with precalculated features. The dataframe specified by dataframe_name
                will be at the root.

            filter_column (str): The name of the column to filter this
                dataframe by.

            filter_values (pd.Series): The values to filter the filter_column
                to.

            parent_data (tuple[Relationship, list[str], pd.DataFrame]): Data
                related to the parent of this trie. This will only be present if
                the relationship points from this dataframe to the parent dataframe. A
                3 tuple of (parent_relationship,
                ancestor_relationship_columns, parent_df).
                ancestor_relationship_columns is the names of columns which
                link the parent dataframe to its ancestors.

            include_cutoff_time (bool): If True, data at cutoff time are included
                in calculating features.

        """
        # Step 1: Get a dataframe for the given dataframe name, filtered by the given
        # conditions.

        (
            need_full_dataframe,
            full_dataframe_features,
            not_full_dataframe_features,
        ) = feature_trie.value

        all_features = full_dataframe_features | not_full_dataframe_features
        columns = self._necessary_columns(dataframe_name, all_features)

        # If we need the full dataframe then don't filter by filter_values.
        if need_full_dataframe:
            query_column = None
            query_values = None
        else:
            query_column = filter_column
            query_values = filter_values

        df = self.entityset.query_by_values(
            dataframe_name=dataframe_name,
            instance_vals=query_values,
            column_name=query_column,
            columns=columns,
            time_last=self.time_last,
            training_window=self.training_window,
            include_cutoff_time=include_cutoff_time,
        )

        # call to update timer
        progress_callback(0)

        # Step 2: Add columns to the dataframe linking it to all ancestors.
        new_ancestor_relationship_columns = []
        if parent_data:
            parent_relationship, ancestor_relationship_columns, parent_df = parent_data

            if ancestor_relationship_columns:
                (
                    df,
                    new_ancestor_relationship_columns,
                ) = self._add_ancestor_relationship_columns(
                    df,
                    parent_df,
                    ancestor_relationship_columns,
                    parent_relationship,
                )

            # Add the column linking this dataframe to its parent, so that
            # descendants get linked to the parent.
            new_ancestor_relationship_columns.append(
                parent_relationship._child_column_name,
            )

        # call to update timer
        progress_callback(0)

        # Step 3: Recurse on children.

        # Pass filtered values, even if we are using a full df.
        if need_full_dataframe:
            filtered_df = df[df[filter_column].isin(filter_values)]
        else:
            filtered_df = df

        for edge, sub_trie in feature_trie.children():
            is_forward, relationship = edge
            if is_forward:
                sub_dataframe_name = relationship.parent_dataframe.ww.name
                sub_filter_column = relationship._parent_column_name
                sub_filter_values = filtered_df[relationship._child_column_name]
                parent_data = None
            else:
                sub_dataframe_name = relationship.child_dataframe.ww.name
                sub_filter_column = relationship._child_column_name
                sub_filter_values = filtered_df[relationship._parent_column_name]

                parent_data = (relationship, new_ancestor_relationship_columns, df)

            sub_df_trie = df_trie.get_node([edge])
            sub_full_dataframe_trie = full_dataframe_trie.get_node([edge])
            sub_precalc_trie = precalculated_trie.get_node([edge])
            self._calculate_features_for_dataframe(
                dataframe_name=sub_dataframe_name,
                feature_trie=sub_trie,
                df_trie=sub_df_trie,
                full_dataframe_trie=sub_full_dataframe_trie,
                precalculated_trie=sub_precalc_trie,
                filter_column=sub_filter_column,
                filter_values=sub_filter_values,
                parent_data=parent_data,
                progress_callback=progress_callback,
                include_cutoff_time=include_cutoff_time,
            )

        # Step 4: Calculate the features for this dataframe.
        #
        # All dependencies of the features for this dataframe have been calculated
        # by the above recursive calls, and their results stored in df_trie.

        # Add any precalculated features.
        precalculated_features_df = precalculated_trie.value
        if precalculated_features_df is not None:
            # Left outer merge to keep all rows of df.
            df = df.merge(
                precalculated_features_df,
                how="left",
                left_index=True,
                right_index=True,
                suffixes=("", "_precalculated"),
            )

        # call to update timer
        progress_callback(0)

        # First, calculate any features that require the full dataframe. These can
        # be calculated first because all of their dependents are included in
        # full_dataframe_features.
        if need_full_dataframe:
            df = self._calculate_features(
                df,
                full_dataframe_trie,
                full_dataframe_features,
                progress_callback,
            )

            # Store full dataframe
            full_dataframe_trie.value = df

            # Filter df so that features that don't require the full dataframe are
            # only calculated on the necessary instances.
            df = df[df[filter_column].isin(filter_values)]

        # Calculate all features that don't require the full dataframe.
        df = self._calculate_features(
            df,
            df_trie,
            not_full_dataframe_features,
            progress_callback,
        )

        # Step 5: Store the dataframe for this dataframe at the root of df_trie, so
        # that it can be accessed by the caller.
        df_trie.value = df

    def _calculate_features(self, df, df_trie, features, progress_callback):
        # Group the features so that each group can be calculated together.
        # The groups must also be in topological order (if A is a transform of B
        # then B must be in a group before A).
        feature_groups = self.feature_set.group_features(features)

        for group in feature_groups:
            representative_feature = group[0]
            handler = self._feature_type_handler(representative_feature)
            df = handler(group, df, df_trie, progress_callback)

        return df

    def _add_ancestor_relationship_columns(
        self,
        child_df,
        parent_df,
        ancestor_relationship_columns,
        relationship,
    ):
        """
        Merge ancestor_relationship_columns from parent_df into child_df, adding a prefix to
        each column name specifying the relationship.

        Return the updated df and the new relationship column names.

        Args:
            child_df (pd.DataFrame): The dataframe to add relationship columns to.
            parent_df (pd.DataFrame): The dataframe to copy relationship columns from.
            ancestor_relationship_columns (list[str]): The names of
                relationship columns in the parent_df to copy into child_df.
            relationship (Relationship): the relationship through which the
                child is connected to the parent.
        """
        relationship_name = relationship.parent_name
        new_relationship_columns = [
            "%s.%s" % (relationship_name, col) for col in ancestor_relationship_columns
        ]

        # create an intermediate dataframe which shares a column
        # with the child dataframe and has a column with the
        # original parent's id.
        col_map = {relationship._parent_column_name: relationship._child_column_name}
        for child_column, parent_column in zip(
            new_relationship_columns,
            ancestor_relationship_columns,
        ):
            col_map[parent_column] = child_column

        merge_df = parent_df[list(col_map.keys())].rename(columns=col_map)

        merge_df.index.name = None  # change index name for merge

        # Merge the dataframe, adding the relationship columns to the child.
        # Left outer join so that all rows in child are kept (if it contains
        # all rows of the dataframe then there may not be corresponding rows in the
        # parent_df).
        df = child_df.merge(
            merge_df,
            how="left",
            left_on=relationship._child_column_name,
            right_on=relationship._child_column_name,
        )

        # ensure index is maintained
        df.set_index(
            relationship.child_dataframe.ww.index,
            drop=False,
            inplace=True,
        )

        return df, new_relationship_columns

    def generate_default_df(self, instance_ids, extra_columns=None):
        default_row = []
        default_cols = []
        for f in self.feature_set.target_features:
            for name in f.get_feature_names():
                default_cols.append(name)
                default_row.append(f.default_value)

        default_matrix = [default_row] * len(instance_ids)
        default_df = pd.DataFrame(
            default_matrix,
            columns=default_cols,
            index=instance_ids,
            dtype="object",
        )
        index_name = self.entityset[self.feature_set.target_df_name].ww.index
        default_df.index.name = index_name
        if extra_columns is not None:
            for c in extra_columns:
                if c not in default_df.columns:
                    default_df[c] = [np.nan] * len(instance_ids)
        return default_df

    def _feature_type_handler(self, f):
        if type(f) == TransformFeature:
            return self._calculate_transform_features
        elif type(f) == GroupByTransformFeature:
            return self._calculate_groupby_features
        elif type(f) == DirectFeature:
            return self._calculate_direct_features
        elif type(f) == AggregationFeature:
            return self._calculate_agg_features
        elif type(f) == IdentityFeature:
            return self._calculate_identity_features
        else:
            raise UnknownFeature("{} feature unknown".format(f.__class__))

    def _calculate_identity_features(self, features, df, _df_trie, progress_callback):
        for f in features:
            assert f.get_name() in df.columns, (
                'Column "%s" missing frome dataframe' % f.get_name()
            )

        progress_callback(len(features) / float(self.num_features))

        return df

    def _calculate_transform_features(
        self,
        features,
        frame,
        _df_trie,
        progress_callback,
    ):
        frame_empty = frame.empty
        feature_values = []
        for f in features:
            # handle when no data
            if frame_empty:
                # Even though we are adding the default values here, when these new
                # features are added to the dataframe in update_feature_columns, they
                # are added as empty columns since the dataframe itself is empty.
                feature_values.append(
                    (f, [f.default_value for _ in range(f.number_output_features)]),
                )
                progress_callback(1 / float(self.num_features))
                continue

            # collect only the columns we need for this transformation

            column_data = [frame[bf.get_name()] for bf in f.base_features]

            feature_func = f.get_function()
            # apply the function to the relevant dataframe slice and add the
            # feature row to the results dataframe.
            if f.primitive.uses_calc_time:
                values = feature_func(*column_data, time=self.time_last)
            else:
                values = feature_func(*column_data)

            # if we don't get just the values, the assignment breaks when indexes don't match
            if f.number_output_features > 1:
                values = [strip_values_if_series(value) for value in values]
            else:
                values = [strip_values_if_series(values)]

            feature_values.append((f, values))

            progress_callback(1 / float(self.num_features))

        frame = update_feature_columns(feature_values, frame)
        return frame

    def _calculate_groupby_features(self, features, frame, _df_trie, progress_callback):
        # set default values to handle the null group
        default_values = {}
        for f in features:
            for name in f.get_feature_names():
                default_values[name] = f.default_value

        frame = pd.concat(
            [frame, pd.DataFrame(default_values, index=frame.index)],
            axis=1,
        )

        # handle when no data
        if frame.shape[0] == 0:
            progress_callback(len(features) / float(self.num_features))

            return frame

        groupby = features[0].groupby.get_name()
        grouped = frame.groupby(groupby)
        groups = frame[
            groupby
        ].unique()  # get all the unique group name to iterate over later

        for f in features:
            feature_vals = []
            for _ in range(f.number_output_features):
                feature_vals.append([])

            for group in groups:
                # skip null key if it exists
                if pd.isnull(group):
                    continue

                column_names = [bf.get_name() for bf in f.base_features]
                # exclude the groupby column from being passed to the function
                column_data = [
                    grouped[name].get_group(group) for name in column_names[:-1]
                ]
                feature_func = f.get_function()

                # apply the function to the relevant dataframe slice and add the
                # feature row to the results dataframe.
                if f.primitive.uses_calc_time:
                    values = feature_func(*column_data, time=self.time_last)
                else:
                    values = feature_func(*column_data)

                if f.number_output_features == 1:
                    values = [values]

                # make sure index is aligned
                for i, value in enumerate(values):
                    if isinstance(value, pd.Series):
                        value.index = column_data[0].index
                    else:
                        value = pd.Series(value, index=column_data[0].index)
                    feature_vals[i].append(value)

            if any(feature_vals):
                assert len(feature_vals) == len(f.get_feature_names())
                for col_vals, name in zip(feature_vals, f.get_feature_names()):
                    frame[name].update(pd.concat(col_vals))

            progress_callback(1 / float(self.num_features))

        return frame

    def _calculate_direct_features(
        self,
        features,
        child_df,
        df_trie,
        progress_callback,
    ):
        path = features[0].relationship_path
        assert len(path) == 1, "Error calculating DirectFeatures, len(path) != 1"

        parent_df = df_trie.get_node([path[0]]).value
        _is_forward, relationship = path[0]
        merge_col = relationship._child_column_name

        # generate a mapping of old column names (in the parent dataframe) to
        # new column names (in the child dataframe) for the merge
        col_map = {relationship._parent_column_name: merge_col}
        index_as_feature = None

        fillna_dict = {}
        for f in features:
            feature_defaults = {
                name: f.default_value
                for name in f.get_feature_names()
                if not pd.isna(f.default_value)
            }
            fillna_dict.update(feature_defaults)
            if f.base_features[0].get_name() == relationship._parent_column_name:
                index_as_feature = f
            base_names = f.base_features[0].get_feature_names()
            for name, base_name in zip(f.get_feature_names(), base_names):
                if name in child_df.columns:
                    continue
                col_map[base_name] = name

        # merge the identity feature from the parent dataframe into the child
        merge_df = parent_df[list(col_map.keys())].rename(columns=col_map)

        if index_as_feature is not None:
            merge_df.set_index(
                index_as_feature.get_name(),
                inplace=True,
                drop=False,
            )
        else:
            merge_df.set_index(merge_col, inplace=True)

        new_df = child_df.merge(
            merge_df,
            left_on=merge_col,
            right_index=True,
            how="left",
        )

        progress_callback(len(features) / float(self.num_features))

        return new_df.fillna(fillna_dict)

    def _calculate_agg_features(self, features, frame, df_trie, progress_callback):
        test_feature = features[0]
        child_dataframe = test_feature.base_features[0].dataframe
        base_frame = df_trie.get_node(test_feature.relationship_path).value
        # Sometimes approximate features get computed in a previous filter frame
        # and put in the current one dynamically,
        # so there may be existing features here
        fl = []
        for f in features:
            for ind in f.get_feature_names():
                if ind not in frame.columns:
                    fl.append(f)
                    break
        features = fl
        if not len(features):
            progress_callback(len(features) / float(self.num_features))
            return frame

        # handle where
        base_frame_empty = base_frame.empty
        where = test_feature.where
        if where is not None and not base_frame_empty:
            base_frame = base_frame.loc[base_frame[where.get_name()]]

        # when no child data, just add all the features to frame with nan
        base_frame_empty = base_frame.empty
        if base_frame_empty:
            feature_values = []
            for f in features:
                feature_values.append((f, np.full(f.number_output_features, np.nan)))
                progress_callback(1 / float(self.num_features))
            frame = update_feature_columns(feature_values, frame)
        else:
            relationship_path = test_feature.relationship_path

            groupby_col = get_relationship_column_id(relationship_path)

            # if the use_previous property exists on this feature, include only the
            # instances from the child dataframe included in that Timedelta
            use_previous = test_feature.use_previous
            if use_previous:
                # Filter by use_previous values
                time_last = self.time_last
                if use_previous.has_no_observations():
                    time_first = time_last - use_previous
                    ti = child_dataframe.ww.time_index
                    if ti is not None:
                        base_frame = base_frame[base_frame[ti] >= time_first]
                else:
                    n = use_previous.get_value("o")

                    def last_n(df):
                        return df.iloc[-n:]

                    base_frame = base_frame.groupby(
                        groupby_col,
                        observed=True,
                        sort=False,
                        group_keys=False,
                    ).apply(last_n)

            to_agg = {}
            agg_rename = {}
            to_apply = set()
            # apply multi-column and time-dependent features as we find them, and
            # save aggregable features for later
            for f in features:
                if _can_agg(f):
                    column_id = f.base_features[0].get_name()
                    if column_id not in to_agg:
                        to_agg[column_id] = []
                    func = f.get_function()

                    # for some reason, using the string count is significantly
                    # faster than any method a primitive can return
                    # https://stackoverflow.com/questions/55731149/use-a-function-instead-of-string-in-pandas-groupby-agg
                    if func == pd.Series.count:
                        func = "count"

                    funcname = func
                    if callable(func):
                        # if the same function is being applied to the same
                        # column twice, wrap it in a partial to avoid
                        # duplicate functions
                        funcname = str(id(func))
                        if "{}-{}".format(column_id, funcname) in agg_rename:
                            func = partial(func)
                            funcname = str(id(func))

                        func.__name__ = funcname

                    to_agg[column_id].append(func)
                    # this is used below to rename columns that pandas names for us
                    agg_rename["{}-{}".format(column_id, funcname)] = f.get_name()
                    continue

                to_apply.add(f)

            # Apply the non-aggregable functions generate a new dataframe, and merge
            # it with the existing one
            if len(to_apply):
                wrap = agg_wrapper(to_apply, self.time_last)
                # groupby_col can be both the name of the index and a column,
                # to silence pandas warning about ambiguity we explicitly pass
                # the column (in actuality grouping by both index and group would
                # work)
                to_merge = base_frame.groupby(
                    base_frame[groupby_col],
                    observed=True,
                    sort=False,
                    group_keys=False,
                ).apply(wrap)
                frame = pd.merge(
                    left=frame,
                    right=to_merge,
                    left_index=True,
                    right_index=True,
                    how="left",
                )

                progress_callback(len(to_apply) / float(self.num_features))

            # Apply the aggregate functions to generate a new dataframe, and merge
            # it with the existing one
            if len(to_agg):
                # groupby_col can be both the name of the index and a column,
                # to silence pandas warning about ambiguity we explicitly pass
                # the column (in actuality grouping by both index and group would
                # work)
                to_merge = base_frame.groupby(
                    base_frame[groupby_col],
                    observed=True,
                    sort=False,
                ).agg(to_agg)
                # rename columns to the correct feature names
                to_merge.columns = [agg_rename["-".join(x)] for x in to_merge.columns]
                to_merge = to_merge[list(agg_rename.values())]

                # Workaround for pandas bug where categories are in the wrong order
                # see: https://github.com/pandas-dev/pandas/issues/22501
                #
                # Pandas claims that bug is fixed but it still shows up in some
                # cases.  More investigation needed.
                if isinstance(frame.index, pd.CategoricalDtype):
                    categories = pdtypes.CategoricalDtype(
                        categories=frame.index.categories,
                    )
                    to_merge.index = to_merge.index.astype(object).astype(categories)

                frame = pd.merge(
                    left=frame,
                    right=to_merge,
                    left_index=True,
                    right_index=True,
                    how="left",
                )

                # determine number of features that were just merged
                progress_callback(len(to_merge.columns) / float(self.num_features))

        # Handle default values
        fillna_dict = {}
        for f in features:
            feature_defaults = {name: f.default_value for name in f.get_feature_names()}
            fillna_dict.update(feature_defaults)

        frame = frame.fillna(fillna_dict)

        return frame

    def _necessary_columns(self, dataframe_name, feature_names):
        # We have to keep all index and foreign columns because we don't know what forward
        # relationships will come from this node.
        df = self.entityset[dataframe_name]
        index_columns = {
            col
            for col in df.columns
            if {"index", "foreign_key", "time_index"} & df.ww.semantic_tags[col]
        }
        features = (self.feature_set.features_by_name[name] for name in feature_names)

        feature_columns = {
            f.column_name for f in features if isinstance(f, IdentityFeature)
        }
        return list(index_columns | feature_columns)


def _can_agg(feature):
    assert isinstance(feature, AggregationFeature)
    base_features = feature.base_features
    if feature.where is not None:
        base_features = [
            bf.get_name()
            for bf in base_features
            if bf.get_name() != feature.where.get_name()
        ]

    if feature.primitive.uses_calc_time:
        return False
    single_output = feature.primitive.number_output_features == 1
    return len(base_features) == 1 and single_output


def agg_wrapper(feats, time_last):
    def wrap(df):
        d = {}
        feature_values = []
        for f in feats:
            func = f.get_function()
            column_ids = [bf.get_name() for bf in f.base_features]
            args = [df[v] for v in column_ids]

            if f.primitive.uses_calc_time:
                values = func(*args, time=time_last)
            else:
                values = func(*args)

            if f.number_output_features == 1:
                values = [values]
            feature_values.append((f, values))

        d = update_feature_columns(feature_values, d)

        return pd.Series(d)

    return wrap


def update_feature_columns(feature_data, data):
    new_cols = {}
    for item in feature_data:
        names = item[0].get_feature_names()
        values = item[1]
        assert len(names) == len(values)
        for name, value in zip(names, values):
            new_cols[name] = value

    # Handle the case where a dict is being updated
    if isinstance(data, dict):
        data.update(new_cols)
        return data

    return pd.concat([data, pd.DataFrame(new_cols, index=data.index)], axis=1)


def strip_values_if_series(values):
    if isinstance(values, pd.Series):
        values = values.values
    return values


================================================
FILE: featuretools/computational_backends/utils.py
================================================
import logging
import os
import typing
import warnings
from datetime import datetime
from functools import wraps

import numpy as np
import pandas as pd
import psutil
from woodwork.logical_types import Datetime, Double

from featuretools.entityset.relationship import RelationshipPath
from featuretools.feature_base import AggregationFeature, DirectFeature
from featuretools.utils import Trie
from featuretools.utils.gen_utils import import_or_none
from featuretools.utils.wrangle import _check_time_type, _check_timedelta

logger = logging.getLogger("featuretools.computational_backend")


def bin_cutoff_times(cutoff_time, bin_size):
    binned_cutoff_time = cutoff_time.ww.copy()
    if isinstance(bin_size, int):
        binned_cutoff_time["time"] = binned_cutoff_time["time"].apply(
            lambda x: x / bin_size * bin_size,
        )
    else:
        bin_size = _check_timedelta(bin_size)
        binned_cutoff_time["time"] = datetime_round(
            binned_cutoff_time["time"],
            bin_size,
        )
    return binned_cutoff_time


def save_csv_decorator(save_progress=None):
    def inner_decorator(method):
        @wraps(method)
        def wrapped(*args, **kwargs):
            if save_progress is None:
                r = method(*args, **kwargs)
            else:
                time = args[0].to_pydatetime()
                file_name = "ft_" + time.strftime("%Y_%m_%d_%I-%M-%S-%f") + ".csv"
                file_path = os.path.join(save_progress, file_name)
                temp_dir = os.path.join(save_progress, "temp")
                if not os.path.exists(temp_dir):
                    os.makedirs(temp_dir)
                temp_file_path = os.path.join(temp_dir, file_name)
                r = method(*args, **kwargs)
                r.to_csv(temp_file_path)
                os.rename(temp_file_path, file_path)
            return r

        return wrapped

    return inner_decorator


def datetime_round(dt, freq):
    """
    round down Timestamp series to a specified freq
    """
    if not freq.is_absolute():
        raise ValueError("Unit is relative")

    # TODO: multitemporal units
    all_units = list(freq.times.keys())
    if len(all_units) == 1:
        unit = all_units[0]
        value = freq.times[unit]
        if unit == "m":
            unit = "t"
        # No support for weeks in datetime.datetime
        if unit == "w":
            unit = "d"
            value = value * 7
        freq = str(value) + unit
        return dt.dt.floor(freq)
    else:
        assert "Frequency cannot have multiple temporal parameters"


def gather_approximate_features(feature_set):
    """
    Find features which can be approximated. Returned as a trie where the values
    are sets of feature names.

    Args:
        feature_set (FeatureSet): Features to search the dependencies of for
            features to approximate.

    Returns:
        Trie[RelationshipPath, set[str]]
    """
    approximate_feature_trie = Trie(default=set, path_constructor=RelationshipPath)

    for feature in feature_set.target_features:
        if feature_set.uses_full_dataframe(feature, check_dependents=True):
            continue

        if isinstance(feature, DirectFeature):
            path = feature.relationship_path
            base_feature = feature.base_features[0]

            while isinstance(base_feature, DirectFeature):
                path = path + base_feature.relationship_path
                base_feature = base_feature.base_features[0]

            if isinstance(base_feature, AggregationFeature):
                node_feature_set = approximate_feature_trie.get_node(path).value
                node_feature_set.add(base_feature.unique_name())

    return approximate_feature_trie


def gen_empty_approx_features_df(approx_features):
    df = pd.DataFrame(columns=[f.get_name() for f in approx_features])
    df.index.name = approx_features[0].dataframe.ww.index
    return df


def n_jobs_to_workers(n_jobs):
    try:
        cpus = len(psutil.Process().cpu_affinity())
    except AttributeError:
        cpus = psutil.cpu_count()

    # Taken from sklearn parallel_backends code
    # https://github.com/scikit-learn/scikit-learn/blob/27bbdb570bac062c71b3bb21b0876fd78adc9f7e/sklearn/externals/joblib/_parallel_backends.py#L120
    if n_jobs < 0:
        workers = max(cpus + 1 + n_jobs, 1)
    else:
        workers = min(n_jobs, cpus)

    assert workers > 0, "Need at least one worker"
    return workers


def create_client_and_cluster(n_jobs, dask_kwargs, entityset_size):
    Client, LocalCluster = get_client_cluster()

    cluster = None
    if "cluster" in dask_kwargs:
        cluster = dask_kwargs["cluster"]
    else:
        # diagnostics_port sets the default port to launch bokeh web interface
        # if it is set to None web interface will not be launched
        diagnostics_port = None
        if "diagnostics_port" in dask_kwargs:
            diagnostics_port = dask_kwargs["diagnostics_port"]
            del dask_kwargs["diagnostics_port"]

        workers = n_jobs_to_workers(n_jobs)
        if n_jobs != -1 and workers < n_jobs:
            warning_string = "{} workers requested, but only {} workers created."
            warning_string = warning_string.format(n_jobs, workers)
            warnings.warn(warning_string)

        # Distributed default memory_limit for worker is 'auto'. It calculates worker
        # memory limit as total virtual memory divided by the number
        # of cores available to the workers (alwasy 1 for featuretools setup).
        # This means reducing the number of workers does not increase the memory
        # limit for other workers.  Featuretools default is to calculate memory limit
        # as total virtual memory divided by number of workers. To use distributed
        # default memory limit, set dask_kwargs['memory_limit']='auto'
        if "memory_limit" in dask_kwargs:
            memory_limit = dask_kwargs["memory_limit"]
            del dask_kwargs["memory_limit"]
        else:
            total_memory = psutil.virtual_memory().total
            memory_limit = int(total_memory / float(workers))

        cluster = LocalCluster(
            n_workers=workers,
            threads_per_worker=1,
            diagnostics_port=diagnostics_port,
            memory_limit=memory_limit,
            **dask_kwargs,
        )

        # if cluster has bokeh port, notify user if unexpected port number
        if diagnostics_port is not None:
            if hasattr(cluster, "scheduler") and cluster.scheduler:
                info = cluster.scheduler.identity()
                if "bokeh" in info["services"]:
                    msg = "Dashboard started on port {}"
                    print(msg.format(info["services"]["bokeh"]))

    client = Client(cluster)

    warned_of_memory = False
    for worker in list(client.scheduler_info()["workers"].values()):
        worker_limit = worker["memory_limit"]
        if worker_limit < entityset_size:
            raise ValueError("Insufficient memory to use this many workers")
        elif worker_limit < 2 * entityset_size and not warned_of_memory:
            logger.warning(
                "Worker memory is between 1 to 2 times the memory"
                " size of the EntitySet. If errors occur that do"
                " not occur with n_jobs equals 1, this may be the "
                "cause.  See https://featuretools.alteryx.com/en/stable/guides/performance.html#parallel-feature-computation"
                " for more information.",
            )
            warned_of_memory = True

    return client, cluster


def get_client_cluster():
    """
    Separated out the imports to make it easier to mock during testing
    """
    distributed = import_or_none("distributed")
    Client = distributed.Client
    LocalCluster = distributed.LocalCluster

    return Client, LocalCluster


CutoffTimeType = typing.Union[pd.DataFrame, str, datetime]


def _validate_cutoff_time(
    cutoff_time: CutoffTimeType,
    target_dataframe,
):
    """
    Verify that the cutoff time is a single value or a pandas dataframe with the proper columns
    containing no duplicate rows
    """
    if isinstance(cutoff_time, pd.DataFrame):
        cutoff_time = cutoff_time.reset_index(drop=True)

        if "instance_id" not in cutoff_time.columns:
            if target_dataframe.ww.index not in cutoff_time.columns:
                raise AttributeError(
                    "Cutoff time DataFrame must contain a column with either the same name"
                    ' as the target dataframe index or a column named "instance_id"',
                )
            # rename to instance_id
            cutoff_time.rename(
                columns={target_dataframe.ww.index: "instance_id"},
                inplace=True,
            )

        if "time" not in cutoff_time.columns:
            if (
                target_dataframe.ww.time_index
                and target_dataframe.ww.time_index not in cutoff_time.columns
            ):
                raise AttributeError(
                    "Cutoff time DataFrame must contain a column with either the same name"
                    ' as the target dataframe time_index or a column named "time"',
                )
            # rename to time
            cutoff_time.rename(
                columns={target_dataframe.ww.time_index: "time"},
                inplace=True,
            )

        # Make sure user supplies only one valid name for instance id and time columns
        if (
            "instance_id" in cutoff_time.columns
            and target_dataframe.ww.index in cutoff_time.columns
            and "instance_id" != target_dataframe.ww.index
        ):
            raise AttributeError(
                'Cutoff time DataFrame cannot contain both a column named "instance_id" and a column'
                " with the same name as the target dataframe index",
            )
        if (
            "time" in cutoff_time.columns
            and target_dataframe.ww.time_index in cutoff_time.columns
            and "time" != target_dataframe.ww.time_index
        ):
            raise AttributeError(
                'Cutoff time DataFrame cannot contain both a column named "time" and a column'
                " with the same name as the target dataframe time index",
            )

        assert (
            cutoff_time[["instance_id", "time"]].duplicated().sum() == 0
        ), "Duplicated rows in cutoff time dataframe."
    if isinstance(cutoff_time, str):
        try:
            cutoff_time = pd.to_datetime(cutoff_time)
        except ValueError as e:
            raise ValueError(f"While parsing cutoff_time: {str(e)}")
        except OverflowError as e:
            raise OverflowError(f"While parsing cutoff_time: {str(e)}")
    else:
        if isinstance(cutoff_time, list):
            raise TypeError("cutoff_time must be a single value or DataFrame")

    return cutoff_time


def _check_cutoff_time_type(cutoff_time, es_time_type):
    """
    Check that the cutoff time values are of the proper type given the entityset time type
    """
    # Check that cutoff_time time type matches entityset time type
    if isinstance(cutoff_time, tuple):
        cutoff_time_value = cutoff_time[0]
        time_type = _check_time_type(cutoff_time_value)
        is_numeric = time_type == "numeric"
        is_datetime = time_type == Datetime
    else:
        cutoff_time_col = cutoff_time.ww["time"]
        is_numeric = cutoff_time_col.ww.schema.is_numeric
        is_datetime = cutoff_time_col.ww.schema.is_datetime

    if es_time_type == "numeric" and not is_numeric:
        raise TypeError(
            "cutoff_time times must be numeric: try casting " "via pd.to_numeric()",
        )
    if es_time_type == Datetime and not is_datetime:
        raise TypeError(
            "cutoff_time times must be datetime type: try casting "
            "via pd.to_datetime()",
        )


def replace_inf_values(feature_matrix, replacement_value=np.nan, columns=None):
    """Replace all ``np.inf`` values in a feature matrix with the specified replacement value.

    Args:
        feature_matrix (DataFrame): DataFrame whose columns are feature names and rows are instances
        replacement_value (int, float, str, optional): Value with which ``np.inf`` values will be replaced
        columns (list[str], optional): A list specifying which columns should have values replaced. If None,
            values will be replaced for all columns.

    Returns:
        feature_matrix

    """
    if columns is None:
        feature_matrix = feature_matrix.replace([np.inf, -np.inf], replacement_value)
    else:
        feature_matrix[columns] = feature_matrix[columns].replace(
            [np.inf, -np.inf],
            replacement_value,
        )
    return feature_matrix


def get_ww_types_from_features(
    features,
    entityset,
    pass_columns=None,
    cutoff_time=None,
):
    """Given a list of features and entityset (and optionally a list of pass
    through columns and the cutoff time dataframe), returns the logical types,
    semantic tags,and origin of each column in the feature matrix.  Both
    pass_columns and cutoff_time will need to be supplied in order to get the
    type information for the pass through columns
    """
    if pass_columns is None:
        pass_columns = []
    logical_types = {}
    semantic_tags = {}
    origins = {}

    for feature in features:
        names = feature.get_feature_names()
        for name in names:
            logical_types[name] = feature.column_schema.logical_type
            semantic_tags[name] = feature.column_schema.semantic_tags.copy()
            semantic_tags[name] -= {"index", "time_index"}

            if logical_types[name] is None and "numeric" in semantic_tags[name]:
                logical_types[name] = Double
            if all([f.primitive is None for f in feature.get_dependencies(deep=True)]):
                origins[name] = "base"
            else:
                origins[name] = "engineered"

    if pass_columns:
        cutoff_schema = cutoff_time.ww.schema
        for column in pass_columns:
            logical_types[column] = cutoff_schema.logical_types[column]
            semantic_tags[column] = cutoff_schema.semantic_tags[column]
            origins[column] = "base"

    ww_init = {
        "logical_types": logical_types,
        "semantic_tags": semantic_tags,
        "column_origins": origins,
    }
    return ww_init


================================================
FILE: featuretools/config_init.py
================================================
import copy
import logging
import os
import sys


def initialize_logging():
    loggers = {}

    # Check for environmental variables
    logger_env_vars = {
        "FEATURETOOLS_LOG_LEVEL": "featuretools",
        "FEATURETOOLS_ES_LOG_LEVEL": "featuretools.entityset",
        "FEATURETOOLS_BACKEND_LOG_LEVEL": "featuretools.computation_backend",
    }
    for logger_env, logger in logger_env_vars.items():
        log_level = os.environ.get(logger_env, None)
        if log_level is not None:
            loggers[logger] = log_level

    # Set log level to info if not otherwise specified.
    loggers.setdefault("featuretools", "info")
    loggers.setdefault("featuretools.computation_backend", "info")
    loggers.setdefault("featuretools.entityset", "info")

    fmt = "%(asctime)-15s %(name)s - %(levelname)s    %(message)s"
    out_handler = logging.StreamHandler(sys.stdout)
    err_handler = logging.StreamHandler(sys.stdout)
    out_handler.setFormatter(logging.Formatter(fmt))
    err_handler.setFormatter(logging.Formatter(fmt))
    err_levels = ["WARNING", "ERROR", "CRITICAL"]

    for name, level in list(loggers.items()):
        LEVEL = getattr(logging, level.upper())
        logger = logging.getLogger(name)
        logger.setLevel(LEVEL)
        for _handler in logger.handlers:
            logger.removeHandler(_handler)

        if level in err_levels:
            logger.addHandler(err_handler)
        else:
            logger.addHandler(out_handler)
        logger.propagate = False


initialize_logging()


class Config:
    def __init__(self):
        self._data = {}
        self.set_to_default()

    def set_to_default(self):
        PWD = os.path.dirname(__file__)
        primitive_data_folder = os.path.join(PWD, "primitives/data")
        self._data = {
            "primitive_data_folder": primitive_data_folder,
        }

    def get(self, key):
        return copy.deepcopy(self._data[key])

    def get_all(self):
        return copy.deepcopy(self._data)

    def set(self, values):
        self._data.update(values)


config = Config()


================================================
FILE: featuretools/demo/__init__.py
================================================
# flake8: noqa
from featuretools.demo.api import *


================================================
FILE: featuretools/demo/api.py
================================================
# flake8: noqa
from featuretools.demo.flight import load_flight
from featuretools.demo.mock_customer import load_mock_customer
from featuretools.demo.retail import load_retail
from featuretools.demo.weather import load_weather


================================================
FILE: featuretools/demo/flight.py
================================================
import math
import re

import pandas as pd
from tqdm import tqdm
from woodwork.logical_types import Boolean, Categorical, Ordinal

import featuretools as ft


def load_flight(
    month_filter=None,
    categorical_filter=None,
    nrows=None,
    demo=True,
    return_single_table=False,
    verbose=False,
):
    """
    Download, clean, and filter flight data from 2017.
    The original dataset can be found `here <https://www.transtats.bts.gov/ot_delay/ot_delaycause1.asp>`_.

    Args:

        month_filter (list[int]): Only use data from these months (example is ``[1, 2]``).
            To skip, set to None.
        categorical_filter (dict[str->str]): Use only specified categorical values.
            Example is ``{'dest_city': ['Boston, MA'], 'origin_city': ['Boston, MA']}``
            which returns all flights in OR out of Boston. To skip, set to None.
        nrows (int): Passed to nrows in ``pd.read_csv``. Used before filtering.
        demo (bool): Use only two months of data. If False, use the whole year.
        return_single_table (bool): Exit the function early and return a dataframe.
        verbose (bool): Show a progress bar while loading the data.

    Examples:

        .. ipython::
            :verbatim:

            In [1]: import featuretools as ft

            In [2]: es = ft.demo.load_flight(verbose=True,
               ...:                          month_filter=[1],
               ...:                          categorical_filter={'origin_city':['Boston, MA']})
            100%|xxxxxxxxxxxxxxxxxxxxxxxxx| 100/100 [01:16<00:00,  1.31it/s]

            In [3]: es
            Out[3]:
            Entityset: Flight Data
              DataFrames:
                airports [Rows: 55, Columns: 3]
                flights [Rows: 613, Columns: 9]
                trip_logs [Rows: 9456, Columns: 22]
                airlines [Rows: 10, Columns: 1]
              Relationships:
                trip_logs.flight_id -> flights.flight_id
                flights.carrier -> airlines.carrier
                flights.dest -> airports.dest
    """

    filename, csv_length = get_flight_filename(demo=demo)

    print("Downloading data ...")
    url = "https://oss.alteryx.com/datasets/{}?library=featuretools&version={}".format(
        filename,
        ft.__version__,
    )

    chunksize = math.ceil(csv_length / 99)
    pd.options.display.max_columns = 200
    iter_csv = pd.read_csv(
        url,
        compression="zip",
        iterator=True,
        nrows=nrows,
        chunksize=chunksize,
    )
    if verbose:
        iter_csv = tqdm(iter_csv, total=100)

    partial_df_list = []
    for chunk in iter_csv:
        df = filter_data(
            _clean_data(chunk),
            month_filter=month_filter,
            categorical_filter=categorical_filter,
        )
        partial_df_list.append(df)
    data = pd.concat(partial_df_list)

    if return_single_table:
        return data

    es = make_es(data)

    return es


def make_es(data):
    es = ft.EntitySet("Flight Data")
    arr_time_columns = [
        "arr_delay",
        "dep_delay",
        "carrier_delay",
        "weather_delay",
        "national_airspace_delay",
        "security_delay",
        "late_aircraft_delay",
        "canceled",
        "diverted",
        "taxi_in",
        "taxi_out",
        "air_time",
        "dep_time",
    ]

    logical_types = {
        "flight_num": Categorical,
        "distance_group": Ordinal(order=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]),
        "canceled": Boolean,
        "diverted": Boolean,
    }

    es.add_dataframe(
        data,
        dataframe_name="trip_logs",
        index="trip_log_id",
        make_index=True,
        time_index="date_scheduled",
        secondary_time_index={"arr_time": arr_time_columns},
        logical_types=logical_types,
    )

    es.normalize_dataframe(
        "trip_logs",
        "flights",
        "flight_id",
        additional_columns=[
            "origin",
            "origin_city",
            "origin_state",
            "dest",
            "dest_city",
            "dest_state",
            "distance_group",
            "carrier",
            "flight_num",
        ],
    )

    es.normalize_dataframe("flights", "airlines", "carrier", make_time_index=False)

    es.normalize_dataframe(
        "flights",
        "airports",
        "dest",
        additional_columns=["dest_city", "dest_state"],
        make_time_index=False,
    )
    return es


def _clean_data(data):
    # Make column names snake case
    clean_data = data.rename(columns={col: convert(col) for col in data})

    # Chance crs -> "scheduled" and other minor clarifications
    clean_data = clean_data.rename(
        columns={
            "crs_arr_time": "scheduled_arr_time",
            "crs_dep_time": "scheduled_dep_time",
            "crs_elapsed_time": "scheduled_elapsed_time",
            "nas_delay": "national_airspace_delay",
            "origin_city_name": "origin_city",
            "dest_city_name": "dest_city",
            "cancelled": "canceled",
        },
    )

    # Combine strings like 0130 (1:30 AM) with dates (2017-01-01)
    clean_data["scheduled_dep_time"] = clean_data["scheduled_dep_time"].apply(
        lambda x: str(x),
    ) + clean_data["flight_date"].astype("str")

    # Parse combined string as a date
    clean_data.loc[:, "scheduled_dep_time"] = pd.to_datetime(
        clean_data["scheduled_dep_time"],
        format="%H%M%Y-%m-%d",
        errors="coerce",
    )

    clean_data["scheduled_elapsed_time"] = pd.to_timedelta(
        clean_data["scheduled_elapsed_time"],
        unit="m",
    )

    clean_data = _reconstruct_times(clean_data)

    # Create a time index 6 months before scheduled_dep
    clean_data.loc[:, "date_scheduled"] = pd.to_datetime(
        clean_data["scheduled_dep_time"],
    ).dt.date - pd.Timedelta("120d")

    # A null entry for a delay means no delay
    clean_data = _fill_labels(clean_data)

    # Nulls for scheduled values are too problematic. Remove them.
    clean_data = clean_data.dropna(
        axis="rows",
        subset=["scheduled_dep_time", "scheduled_arr_time"],
    )

    # Make a flight id. Define a flight as a combination of:
    # 1. carrier 2. flight number 3. origin airport 4. dest airport
    clean_data.loc[:, "flight_id"] = (
        clean_data["carrier"]
        + "-"
        + clean_data["flight_num"].apply(lambda x: str(x))
        + ":"
        + clean_data["origin"]
        + "->"
        + clean_data["dest"]
    )

    column_order = [
        "flight_id",
        "flight_num",
        "date_scheduled",
        "scheduled_dep_time",
        "scheduled_arr_time",
        "carrier",
        "origin",
        "origin_city",
        "origin_state",
        "dest",
        "dest_city",
        "dest_state",
        "distance_group",
        "dep_time",
        "arr_time",
        "dep_delay",
        "taxi_out",
        "taxi_in",
        "arr_delay",
        "diverted",
        "scheduled_elapsed_time",
        "air_time",
        "distance",
        "carrier_delay",
        "weather_delay",
        "national_airspace_delay",
        "security_delay",
        "late_aircraft_delay",
        "canceled",
    ]

    clean_data = clean_data[column_order]

    return clean_data


def _fill_labels(clean_data):
    labely_columns = [
        "arr_delay",
        "dep_delay",
        "carrier_delay",
        "weather_delay",
        "national_airspace_delay",
        "security_delay",
        "late_aircraft_delay",
        "canceled",
        "diverted",
        "taxi_in",
        "taxi_out",
        "air_time",
    ]
    for col in labely_columns:
        clean_data.loc[:, col] = clean_data[col].fillna(0)

    return clean_data


def _reconstruct_times(clean_data):
    """Reconstruct departure_time, scheduled_dep_time,
    arrival_time and scheduled_arr_time by adding known delays
    to known times. We do:
        - dep_time is scheduled_dep + dep_delay
        - arr_time is dep_time + taxiing and air_time
        - scheduled arrival is scheduled_dep + scheduled_elapsed
    """
    clean_data.loc[:, "dep_time"] = clean_data["scheduled_dep_time"] + pd.to_timedelta(
        clean_data["dep_delay"],
        unit="m",
    )

    clean_data.loc[:, "arr_time"] = clean_data["dep_time"] + pd.to_timedelta(
        clean_data["taxi_out"] + clean_data["air_time"] + clean_data["taxi_in"],
        unit="m",
    )

    clean_data.loc[:, "scheduled_arr_time"] = (
        clean_data["scheduled_dep_time"] + clean_data["scheduled_elapsed_time"]
    )
    return clean_data


def filter_data(clean_data, month_filter=None, categorical_filter=None):
    if month_filter is not None:
        tmp = pd.to_datetime(clean_data["scheduled_dep_time"]).dt.month.isin(
            month_filter,
        )
        clean_data = clean_data[tmp]

    if categorical_filter is not None:
        tmp = False
        for key, values in categorical_filter.items():
            tmp = tmp | clean_data[key].isin(values)
        clean_data = clean_data[tmp]

    return clean_data


def convert(name):
    # Rename columns to underscore
    # Code via SO https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case
    s1 = re.sub("(.)([A-Z][a-z]+)", r"\1_\2", name)
    return re.sub("([a-z0-9])([A-Z])", r"\1_\2", s1).lower()


def get_flight_filename(demo=True):
    if demo:
        filename = SMALL_FLIGHT_CSV
        rows = 860457
    else:
        filename = BIG_FLIGHT_CSV
        rows = 5162742

    return filename, rows


SMALL_FLIGHT_CSV = "data_2017_jan_feb.csv.zip"
BIG_FLIGHT_CSV = "data_all_2017.csv.zip"


================================================
FILE: featuretools/demo/mock_customer.py
================================================
import pandas as pd
from numpy import random
from numpy.random import choice
from woodwork.logical_types import Categorical, PostalCode

import featuretools as ft


def load_mock_customer(
    n_customers=5,
    n_products=5,
    n_sessions=35,
    n_transactions=500,
    random_seed=0,
    return_single_table=False,
    return_entityset=False,
):
    """Return dataframes of mock customer data"""

    random.seed(random_seed)
    last_date = pd.to_datetime("12/31/2013")
    first_date = pd.to_datetime("1/1/2008")
    first_bday = pd.to_datetime("1/1/1970")

    join_dates = [
        random.uniform(0, 1) * (last_date - first_date) + first_date
        for _ in range(n_customers)
    ]
    birth_dates = [
        random.uniform(0, 1) * (first_date - first_bday) + first_bday
        for _ in range(n_customers)
    ]

    customers_df = pd.DataFrame({"customer_id": range(1, n_customers + 1)})
    customers_df["zip_code"] = choice(
        ["60091", "13244"],
        n_customers,
    )
    customers_df["join_date"] = pd.Series(join_dates).dt.round("1s")
    customers_df["birthday"] = pd.Series(birth_dates).dt.round("1d")

    products_df = pd.DataFrame({"product_id": pd.Categorical(range(1, n_products + 1))})
    products_df["brand"] = choice(["A", "B", "C"], n_products)

    sessions_df = pd.DataFrame({"session_id": range(1, n_sessions + 1)})
    sessions_df["customer_id"] = choice(customers_df["customer_id"], n_sessions)
    sessions_df["device"] = choice(["desktop", "mobile", "tablet"], n_sessions)

    transactions_df = pd.DataFrame({"transaction_id": range(1, n_transactions + 1)})
    transactions_df["session_id"] = choice(sessions_df["session_id"], n_transactions)
    transactions_df = transactions_df.sort_values("session_id").reset_index(drop=True)
    transactions_df["transaction_time"] = pd.date_range(
        "1/1/2014",
        periods=n_transactions,
        freq="65s",
    )  # todo make these less regular
    transactions_df["product_id"] = pd.Categorical(
        choice(products_df["product_id"], n_transactions),
    )
    transactions_df["amount"] = random.randint(500, 15000, n_transactions) / 100

    # calculate and merge in session start
    # based on the times we came up with for transactions
    session_starts = transactions_df.drop_duplicates("session_id")[
        ["session_id", "transaction_time"]
    ].rename(columns={"transaction_time": "session_start"})
    sessions_df = sessions_df.merge(session_starts)

    if return_single_table:
        return (
            transactions_df.merge(sessions_df)
            .merge(customers_df)
            .merge(products_df)
            .reset_index(drop=True)
        )
    elif return_entityset:
        es = ft.EntitySet(id="transactions")
        es = es.add_dataframe(
            dataframe_name="transactions",
            dataframe=transactions_df,
            index="transaction_id",
            time_index="transaction_time",
            logical_types={"product_id": Categorical},
        )

        es = es.add_dataframe(
            dataframe_name="products",
            dataframe=products_df,
            index="product_id",
        )

        es = es.add_dataframe(
            dataframe_name="sessions",
            dataframe=sessions_df,
            index="session_id",
            time_index="session_start",
        )

        es = es.add_dataframe(
            dataframe_name="customers",
            dataframe=customers_df,
            index="customer_id",
            time_index="join_date",
            logical_types={"zip_code": PostalCode},
        )

        rels = [
            ("products", "product_id", "transactions", "product_id"),
            ("sessions", "session_id", "transactions", "session_id"),
            ("customers", "customer_id", "sessions", "customer_id"),
        ]
        es = es.add_relationships(rels)
        es.add_last_time_indexes()
        return es

    return {
        "customers": customers_df,
        "sessions": sessions_df,
        "transactions": transactions_df,
        "products": products_df,
    }


================================================
FILE: featuretools/demo/retail.py
================================================
import pandas as pd
from woodwork.logical_types import NaturalLanguage

import featuretools as ft


def load_retail(id="demo_retail_data", nrows=None, return_single_table=False):
    """Returns the retail entityset example.
    The original dataset can be found `here <https://archive.ics.uci.edu/ml/datasets/online+retail>`_.

    We have also made some modifications to the data. We
    changed the column names, converted the ``customer_id``
    to a unique fake ``customer_name``, dropped duplicates,
    added columns for ``total`` and ``cancelled`` and
    converted amounts from GBP to USD. You can download the modified CSV in gz `compressed (7 MB)
    <https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv.gz>`_
    or `uncompressed (43 MB)
    <https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv>`_ formats.

    Args:
        id (str):  Id to assign to EntitySet.
        nrows (int):  Number of rows to load of the underlying CSV.
            If None, load all.
        return_single_table (bool): If True, return a CSV rather than an EntitySet. Default is False.

    Examples:

        .. ipython::
            :verbatim:

            In [1]: import featuretools as ft

            In [2]: es = ft.demo.load_retail()

            In [3]: es
            Out[3]:
            Entityset: demo_retail_data
              DataFrames:
                orders (shape = [22190, 3])
                products (shape = [3684, 3])
                customers (shape = [4372, 2])
                order_products (shape = [401704, 7])

        Load in subset of data

        .. ipython::
            :verbatim:

            In [4]: es = ft.demo.load_retail(nrows=1000)

            In [5]: es
            Out[5]:
            Entityset: demo_retail_data
              DataFrames:
                orders (shape = [67, 5])
                products (shape = [606, 3])
                customers (shape = [50, 2])
                order_products (shape = [1000, 7])
    """
    es = ft.EntitySet(id)
    csv_s3_gz = (
        "https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv.gz?library=featuretools&version="
        + ft.__version__
    )
    csv_s3 = (
        "https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv?library=featuretools&version="
        + ft.__version__
    )
    # Try to read in gz compressed file
    try:
        df = pd.read_csv(csv_s3_gz, nrows=nrows, parse_dates=["order_date"])
    # Fall back to uncompressed
    except Exception:
        df = pd.read_csv(csv_s3, nrows=nrows, parse_dates=["order_date"])
    if return_single_table:
        return df

    es.add_dataframe(
        dataframe_name="order_products",
        dataframe=df,
        index="order_product_id",
        make_index=True,
        time_index="order_date",
        logical_types={"description": NaturalLanguage},
    )

    es.normalize_dataframe(
        new_dataframe_name="products",
        base_dataframe_name="order_products",
        index="product_id",
        additional_columns=["description"],
    )

    es.normalize_dataframe(
        new_dataframe_name="orders",
        base_dataframe_name="order_products",
        index="order_id",
        additional_columns=["customer_name", "country", "cancelled"],
    )

    es.normalize_dataframe(
        new_dataframe_name="customers",
        base_dataframe_name="orders",
        index="customer_name",
    )
    es.add_last_time_indexes()

    return es


================================================
FILE: featuretools/demo/weather.py
================================================
import pandas as pd

import featuretools as ft


def load_weather(nrows=None, return_single_table=False):
    """
    Load the Australian daily-min-temperatures weather dataset.

    Args:

        nrows (int): Passed to nrows in ``pd.read_csv``.
        return_single_table (bool): Exit the function early and return a dataframe.

    """
    filename = "daily-min-temperatures.csv"
    print("Downloading data ...")
    url = "https://oss.alteryx.com/datasets/{}?library=featuretools&version={}".format(
        filename,
        ft.__version__,
    )
    data = pd.read_csv(url, index_col=None, nrows=nrows)
    if return_single_table:
        return data
    es = make_es(data)
    return es


def make_es(data):
    es = ft.EntitySet("Weather Data")

    es.add_dataframe(
        data,
        dataframe_name="temperatures",
        index="id",
        make_index=True,
        time_index="Date",
    )
    return es


================================================
FILE: featuretools/entityset/__init__.py
================================================
# flake8: noqa
from featuretools.entityset.api import *


================================================
FILE: featuretools/entityset/api.py
================================================
# flake8: noqa
from featuretools.entityset.deserialize import read_entityset
from featuretools.entityset.entityset import EntitySet
from featuretools.entityset.relationship import Relationship
from featuretools.entityset.timedelta import Timedelta


================================================
FILE: featuretools/entityset/deserialize.py
================================================
import json
import os
import tarfile
import tempfile
from inspect import getfullargspec

import pandas as pd
import woodwork.type_sys.type_system as ww_type_system
from woodwork.deserialize import read_woodwork_table

from featuretools.entityset.relationship import Relationship
from featuretools.utils.s3_utils import get_transport_params, use_smartopen_es
from featuretools.utils.schema_utils import check_schema_version
from featuretools.utils.wrangle import _is_local_tar, _is_s3, _is_url


def description_to_entityset(description, **kwargs):
    """Deserialize entityset from data description.

    Args:
        description (dict) : Description of an :class:`.EntitySet`. Likely generated using :meth:`.serialize.entityset_to_description`
        kwargs (keywords): Additional keyword arguments to pass as keywords arguments to the underlying deserialization method.

    Returns:
        entityset (EntitySet) : Instance of :class:`.EntitySet`.
    """
    check_schema_version(description, "entityset")

    from featuretools.entityset import EntitySet

    # If data description was not read from disk, path is None.
    path = description.get("path")
    entityset = EntitySet(description["id"])

    for df in description["dataframes"].values():
        if path is not None:
            data_path = os.path.join(path, "data", df["name"])
            format = description.get("format")
            if format is not None:
                kwargs["format"] = format
                if format == "parquet" and df["loading_info"]["table_type"] == "pandas":
                    kwargs["filename"] = df["name"] + ".parquet"
            dataframe = read_woodwork_table(data_path, validate=False, **kwargs)
        else:
            dataframe = empty_dataframe(df)

        entityset.add_dataframe(dataframe)

    for relationship in description["relationships"]:
        rel = Relationship.from_dictionary(relationship, entityset)
        entityset.add_relationship(relationship=rel)

    return entityset


def empty_dataframe(description):
    """Deserialize empty dataframe from dataframe description.

    Args:
        description (dict) : Description of dataframe.

    Returns:
        df (DataFrame) : Empty dataframe with Woodwork initialized.
    """
    # TODO: Can we update Woodwork to return an empty initialized dataframe from a description
    # instead of using this function? Or otherwise eliminate? Issue #1476
    logical_types = {}
    semantic_tags = {}
    column_descriptions = {}
    column_metadata = {}
    use_standard_tags = {}
    category_dtypes = {}
    columns = []
    for col in description["column_typing_info"]:
        col_name = col["name"]
        columns.append(col_name)

        ltype_metadata = col["logical_type"]
        ltype = ww_type_system.str_to_logical_type(
            ltype_metadata["type"],
            params=ltype_metadata["parameters"],
        )

        tags = col["semantic_tags"]

        if "index" in tags:
            tags.remove("index")
        elif "time_index" in tags:
            tags.remove("time_index")

        logical_types[col_name] = ltype
        semantic_tags[col_name] = tags
        column_descriptions[col_name] = col["description"]
        column_metadata[col_name] = col["metadata"]
        use_standard_tags[col_name] = col["use_standard_tags"]

        if col["physical_type"]["type"] == "category":
            # Make sure categories are recreated properly
            cat_values = col["physical_type"]["cat_values"]
            cat_dtype = col["physical_type"]["cat_dtype"]
            cat_object = pd.CategoricalDtype(pd.Index(cat_values, dtype=cat_dtype))
            category_dtypes[col_name] = cat_object

    dataframe = pd.DataFrame(columns=columns).astype(category_dtypes)

    dataframe.ww.init(
        name=description.get("name"),
        index=description.get("index"),
        time_index=description.get("time_index"),
        logical_types=logical_types,
        semantic_tags=semantic_tags,
        use_standard_tags=use_standard_tags,
        table_metadata=description.get("table_metadata"),
        column_metadata=column_metadata,
        column_descriptions=column_descriptions,
        validate=False,
    )

    return dataframe


def read_data_description(path):
    """Read data description from disk, S3 path, or URL.

    Args:
        path (str): Location on disk, S3 path, or URL to read `data_description.json`.

    Returns:
        description (dict) : Description of :class:`.EntitySet`.
    """

    path = os.path.abspath(path)
    assert os.path.exists(path), '"{}" does not exist'.format(path)
    filepath = os.path.join(path, "data_description.json")
    with open(filepath, "r") as file:
        description = json.load(file)
    description["path"] = path
    return description


def read_entityset(path, profile_name=None, **kwargs):
    """Read entityset from disk, S3 path, or URL.

    NOTE: Never attempt to read an archived EntitySet from an untrusted source.

    Args:
        path (str): Directory on disk, S3 path, or URL to read `data_description.json`.
        profile_name (str, bool): The AWS profile specified to write to S3. Will default to None and search for AWS credentials.
            Set to False to use an anonymous profile.
        kwargs (keywords): Additional keyword arguments to pass as keyword arguments to the underlying deserialization method.
    """
    if _is_url(path) or _is_s3(path) or _is_local_tar(str(path)):
        with tempfile.TemporaryDirectory() as tmpdir:
            local_path = path
            transport_params = None

            if _is_s3(path):
                transport_params = get_transport_params(profile_name)

            if _is_s3(path) or _is_url(path):
                local_path = os.path.join(tmpdir, "temporary_es")
                use_smartopen_es(local_path, path, transport_params)

            with tarfile.open(str(local_path)) as tar:
                if "filter" in getfullargspec(tar.extractall).kwonlyargs:
                    tar.extractall(path=tmpdir, filter="data")
                else:
                    raise RuntimeError(
                        "Please upgrade your Python version to the latest patch release to allow for safe extraction of the EntitySet archive.",
                    )

            data_description = read_data_description(tmpdir)
            return description_to_entityset(data_description, **kwargs)
    else:
        data_description = read_data_description(path)
        return description_to_entityset(data_description, **kwargs)


================================================
FILE: featuretools/entityset/entityset.py
================================================
import copy
import logging
import warnings
from collections import defaultdict

import numpy as np
import pandas as pd
from woodwork import init_series
from woodwork.logical_types import Datetime, LatLong

from featuretools.entityset import deserialize, serialize
from featuretools.entityset.relationship import Relationship, RelationshipPath
from featuretools.feature_base.feature_base import _ES_REF
from featuretools.utils.plot_utils import (
    check_graphviz,
    get_graphviz_format,
    save_graph,
)
from featuretools.utils.wrangle import _check_timedelta

pd.options.mode.chained_assignment = None  # default='warn'
logger = logging.getLogger("featuretools.entityset")

LTI_COLUMN_NAME = "_ft_last_time"
WW_SCHEMA_KEY = "_ww__getstate__schemas"


class EntitySet(object):
    """
    Stores all actual data and typing information for an entityset

    Attributes:
        id
        dataframe_dict
        relationships
        time_type

    Properties:
        metadata

    """

    def __init__(self, id=None, dataframes=None, relationships=None):
        """Creates EntitySet

        Args:
            id (str) : Unique identifier to associate with this instance
            dataframes (dict[str -> tuple(DataFrame, str, str, dict[str -> str/Woodwork.LogicalType], dict[str->str/set], boolean)]):
                Dictionary of DataFrames. Entries take the format
                {dataframe name -> (dataframe, index column, time_index, logical_types, semantic_tags, make_index)}.
                Note that only the dataframe is required. If a Woodwork DataFrame is supplied, any other parameters
                will be ignored.
            relationships (list[(str, str, str, str)]): List of relationships
                between dataframes. List items are a tuple with the format
                (parent dataframe name, parent column, child dataframe name, child column).

        Example:

            .. code-block:: python

                dataframes = {
                    "cards" : (card_df, "id"),
                    "transactions" : (transactions_df, "id", "transaction_time")
                }

                relationships = [("cards", "id", "transactions", "card_id")]

                ft.EntitySet("my-entity-set", dataframes, relationships)
        """
        self.id = id
        self.dataframe_dict = {}
        self.relationships = []
        self.time_type = None

        dataframes = dataframes or {}
        relationships = relationships or []
        for df_name in dataframes:
            df = dataframes[df_name][0]
            if df.ww.schema is not None and df.ww.name != df_name:
                raise ValueError(
                    f"Naming conflict in dataframes dictionary: dictionary key '{df_name}' does not match dataframe name '{df.ww.name}'",
                )

            index_column = None
            time_index = None
            make_index = False
            semantic_tags = None
            logical_types = None
            if len(dataframes[df_name]) > 1:
                index_column = dataframes[df_name][1]
            if len(dataframes[df_name]) > 2:
                time_index = dataframes[df_name][2]
            if len(dataframes[df_name]) > 3:
                logical_types = dataframes[df_name][3]
            if len(dataframes[df_name]) > 4:
                semantic_tags = dataframes[df_name][4]
            if len(dataframes[df_name]) > 5:
                make_index = dataframes[df_name][5]
            self.add_dataframe(
                dataframe_name=df_name,
                dataframe=df,
                index=index_column,
                time_index=time_index,
                logical_types=logical_types,
                semantic_tags=semantic_tags,
                make_index=make_index,
            )

        for relationship in relationships:
            parent_df, parent_column, child_df, child_column = relationship
            self.add_relationship(parent_df, parent_column, child_df, child_column)

        self.reset_data_description()
        _ES_REF[self.id] = self

    def __sizeof__(self):
        return sum([df.__sizeof__() for df in self.dataframes])

    def __dask_tokenize__(self):
        return (EntitySet, serialize.entityset_to_description(self.metadata))

    def __eq__(self, other, deep=False):
        if self.id != other.id:
            return False
        if self.time_type != other.time_type:
            return False
        if len(self.dataframe_dict) != len(other.dataframe_dict):
            return False
        for df_name, df in self.dataframe_dict.items():
            if df_name not in other.dataframe_dict:
                return False
            if not df.ww.__eq__(other[df_name].ww, deep=deep):
                return False
        if not len(self.relationships) == len(other.relationships):
            return False
        for r in self.relationships:
            if r not in other.relationships:
                return False
        return True

    def __ne__(self, other, deep=False):
        return not self.__eq__(other, deep=deep)

    def __getitem__(self, dataframe_name):
        """Get dataframe instance from entityset

        Args:
            dataframe_name (str): Name of dataframe.

        Returns:
            :class:`.DataFrame` : Instance of dataframe with Woodwork typing information. None if dataframe doesn't
                exist on the entityset.
        """
        if dataframe_name in self.dataframe_dict:
            return self.dataframe_dict[dataframe_name]
        name = self.id or "entity set"
        raise KeyError("DataFrame %s does not exist in %s" % (dataframe_name, name))

    def __deepcopy__(self, memo):
        cls = self.__class__
        result = cls.__new__(cls)
        memo[id(self)] = result
        for k, v in self.__dict__.items():
            if k == "dataframe_dict":
                # Copy the DataFrames, retaining Woodwork typing information
                copied_attr = copy.copy(v)
                for df_name, df in copied_attr.items():
                    copied_attr[df_name] = df.ww.copy()
            else:
                copied_attr = copy.deepcopy(v, memo)

            setattr(result, k, copied_attr)

        for df in result.dataframe_dict.values():
            result._add_references_to_metadata(df)
        return result

    @property
    def dataframes(self):
        return list(self.dataframe_dict.values())

    @property
    def metadata(self):
        """Returns the metadata for this EntitySet. The metadata will be recomputed if it does not exist."""
        if self._data_description is None:
            description = serialize.entityset_to_description(self)
            self._data_description = deserialize.description_to_entityset(description)

        return self._data_description

    def reset_data_description(self):
        self._data_description = None

    def to_pickle(self, path, compression=None, profile_name=None):
        """Write entityset in the pickle format, location specified by `path`.
        Path could be a local path or a S3 path.
        If writing to S3 a tar archive of files will be written.

        Args:
            path (str): location on disk to write to (will be created as a directory)
            compression (str) : Name of the compression to use. Possible values are: {'gzip', 'bz2', 'zip', 'xz', None}.
            profile_name (str) : Name of AWS profile to use, False to use an anonymous profile, or None.
        """
        serialize.write_data_description(
            self,
            path,
            format="pickle",
            compression=compression,
            profile_name=profile_name,
        )
        return self

    def to_parquet(self, path, engine="auto", compression=None, profile_name=None):
        """Write entityset to disk in the parquet format, location specified by `path`.
        Path could be a local path or a S3 path.
        If writing to S3 a tar archive of files will be written.

        Args:
            path (str): location on disk to write to (will be created as a directory)
            engine (str) : Name of the engine to use. Possible values are: {'auto', 'pyarrow'}.
            compression (str) : Name of the compression to use. Possible values are: {'snappy', 'gzip', 'brotli', None}.
            profile_name (str) : Name of AWS profile to use, False to use an anonymous profile, or None.
        """
        serialize.write_data_description(
            self,
            path,
            format="parquet",
            engine=engine,
            compression=compression,
            profile_name=profile_name,
        )
        return self

    def to_csv(
        self,
        path,
        sep=",",
        encoding="utf-8",
        engine="python",
        compression=None,
        profile_name=None,
    ):
        """Write entityset to disk in the csv format, location specified by `path`.
        Path could be a local path or a S3 path.
        If writing to S3 a tar archive of files will be written.

        Args:
            path (str) : Location on disk to write to (will be created as a directory)
            sep (str) : String of length 1. Field delimiter for the output file.
            encoding (str) : A string representing the encoding to use in the output file, defaults to 'utf-8'.
            engine (str) : Name of the engine to use. Possible values are: {'c', 'python'}.
            compression (str) : Name of the compression to use. Possible values are: {'gzip', 'bz2', 'zip', 'xz', None}.
            profile_name (str) : Name of AWS profile to use, False to use an anonymous profile, or None.
        """
        serialize.write_data_description(
            self,
            path,
            format="csv",
            index=False,
            sep=sep,
            encoding=encoding,
            engine=engine,
            compression=compression,
            profile_name=profile_name,
        )
        return self

    def to_dictionary(self):
        return serialize.entityset_to_description(self)

    ###########################################################################
    #   Public getter/setter methods  #########################################
    ###########################################################################

    def __repr__(self):
        repr_out = "Entityset: {}\n".format(self.id)
        repr_out += "  DataFrames:"
        for df in self.dataframes:
            if df.shape:
                repr_out += "\n    {} [Rows: {}, Columns: {}]".format(
                    df.ww.name,
                    df.shape[0],
                    df.shape[1],
                )
            else:
                repr_out += "\n    {} [Rows: None, Columns: None]".format(df.ww.name)
        repr_out += "\n  Relationships:"

        if len(self.relationships) == 0:
            repr_out += "\n    No relationships"

        for r in self.relationships:
            repr_out += "\n    %s.%s -> %s.%s" % (
                r._child_dataframe_name,
                r._child_column_name,
                r._parent_dataframe_name,
                r._parent_column_name,
            )

        return repr_out

    def add_relationships(self, relationships):
        """Add multiple new relationships to a entityset

        Args:
            relationships (list[tuple(str, str, str, str)] or list[Relationship]) : List of
                new relationships to add. Relationships are specified either as a :class:`.Relationship`
                object or a four element tuple identifying the parent and child columns:
                (parent_dataframe_name, parent_column_name, child_dataframe_name, child_column_name)
        """
        for rel in relationships:
            if isinstance(rel, Relationship):
                self.add_relationship(relationship=rel)
            else:
                self.add_relationship(*rel)
        return self

    def add_relationship(
        self,
        parent_dataframe_name=None,
        parent_column_name=None,
        child_dataframe_name=None,
        child_column_name=None,
        relationship=None,
    ):
        """Add a new relationship between dataframes in the entityset. Relationships can be specified
        by passing dataframe and columns names or by passing a :class:`.Relationship` object.

        Args:
            parent_dataframe_name (str): Name of the parent dataframe in the EntitySet. Must be specified
                if relationship is not.
            parent_column_name (str): Name of the parent column. Must be specified if relationship is not.
            child_dataframe_name (str): Name of the child dataframe in the EntitySet. Must be specified
                if relationship is not.
            child_column_name (str): Name of the child column. Must be specified if relationship is not.
            relationship (Relationship): Instance of new relationship to be added. Must be specified
                if dataframe and column names are not supplied.
        """
        if relationship and (
            parent_dataframe_name
            or parent_column_name
            or child_dataframe_name
            or child_column_name
        ):
            raise ValueError(
                "Cannot specify dataframe and column name values and also supply a Relationship",
            )

        if not relationship:
            relationship = Relationship(
                self,
                parent_dataframe_name,
                parent_column_name,
                child_dataframe_name,
                child_column_name,
            )
        if relationship in self.relationships:
            warnings.warn("Not adding duplicate relationship: " + str(relationship))
            return self

        # _operations?

        # this is a new pair of dataframes
        child_df = relationship.child_dataframe
        child_column = relationship._child_column_name
        if child_df.ww.index == child_column:
            msg = "Unable to add relationship because child column '{}' in '{}' is also its index"
            raise ValueError(msg.format(child_column, child_df.ww.name))
        parent_df = relationship.parent_dataframe
        parent_column = relationship._parent_column_name

        if parent_df.ww.index != parent_column:
            parent_df.ww.set_index(parent_column)

        # Empty dataframes (as a result of accessing metadata)
        # default to object dtypes for categorical columns, but
        # indexes/foreign keys default to ints. In this case, we convert
        # the empty column's type to int
        if (
            child_df.empty
            and child_df[child_column].dtype == object
            and parent_df.ww.columns[parent_column].is_numeric
        ):
            child_df.ww[child_column] = pd.Series(name=child_column, dtype=np.int64)

        parent_ltype = parent_df.ww.logical_types[parent_column]
        child_ltype = child_df.ww.logical_types[child_column]
        if parent_ltype != child_ltype:
            difference_msg = ""
            if str(parent_ltype) == str(child_ltype):
                difference_msg = "There is a conflict between the parameters. "

            warnings.warn(
                f"Logical type {child_ltype} for child column {child_column} does not match "
                f"parent column {parent_column} logical type {parent_ltype}. {difference_msg}"
                "Changing child logical type to match parent.",
            )
            child_df.ww.set_types(logical_types={child_column: parent_ltype})

        if "foreign_key" not in child_df.ww.semantic_tags[child_column]:
            child_df.ww.add_semantic_tags({child_column: "foreign_key"})

        self.relationships.append(relationship)
        self.reset_data_description()
        return self

    def set_secondary_time_index(self, dataframe_name, secondary_time_index):
        """
        Set the secondary time index for a dataframe in the EntitySet using its dataframe name.

        Args:
            dataframe_name (str) : name of the dataframe for which to set the secondary time index.
            secondary_time_index (dict[str-> list[str]]): Name of column containing time data to
                be used as a secondary time index mapped to a list of the columns in the dataframe
                associated with that secondary time index.
        """
        dataframe = self[dataframe_name]
        self._set_secondary_time_index(dataframe, secondary_time_index)

    def _set_secondary_time_index(self, dataframe, secondary_time_index):
        """Sets the secondary time index for a Woodwork dataframe passed in"""
        assert (
            dataframe.ww.schema is not None
        ), "Cannot set secondary time index if Woodwork is not initialized"
        self._check_secondary_time_index(dataframe, secondary_time_index)
        if secondary_time_index is not None:
            dataframe.ww.metadata["secondary_time_index"] = secondary_time_index

    ###########################################################################
    #   Relationship access/helper methods  ###################################
    ###########################################################################

    def find_forward_paths(self, start_dataframe_name, goal_dataframe_name):
        """
        Generator which yields all forward paths between a start and goal
        dataframe. Does not include paths which contain cycles.

        Args:
            start_dataframe_name (str) : name of dataframe to start the search from
            goal_dataframe_name  (str) : name of dataframe to find forward path to

        See Also:
            :func:`BaseEntitySet.find_backward_paths`
        """
        for sub_dataframe_name, path in self._forward_dataframe_paths(
            start_dataframe_name,
        ):
            if sub_dataframe_name == goal_dataframe_name:
                yield path

    def find_backward_paths(self, start_dataframe_name, goal_dataframe_name):
        """
        Generator which yields all backward paths between a start and goal
        dataframe. Does not include paths which contain cycles.

        Args:
            start_dataframe_name (str) : Name of dataframe to start the search from.
            goal_dataframe_name  (str) : Name of dataframe to find backward path to.

        See Also:
            :func:`BaseEntitySet.find_forward_paths`
        """
        for path in self.find_forward_paths(goal_dataframe_name, start_dataframe_name):
            # Reverse path
            yield path[::-1]

    def _forward_dataframe_paths(self, start_dataframe_name, seen_dataframes=None):
        """
        Generator which yields the names of all dataframes connected through forward
        relationships, and the path taken to each. A dataframe will be yielded
        multiple times if there are multiple paths to it.

        Implemented using depth first search.
        """
        if seen_dataframes is None:
            seen_dataframes = set()

        if start_dataframe_name in seen_dataframes:
            return

        seen_dataframes.add(start_dataframe_name)

        yield start_dataframe_name, []

        for relationship in self.get_forward_relationships(start_dataframe_name):
            next_dataframe = relationship._parent_dataframe_name
            # Copy seen dataframes for each next node to allow multiple paths (but
            # not cycles).
            descendants = self._forward_dataframe_paths(
                next_dataframe,
                seen_dataframes.copy(),
            )
            for sub_dataframe_name, sub_path in descendants:
                yield sub_dataframe_name, [relationship] + sub_path

    def get_forward_dataframes(self, dataframe_name, deep=False):
        """
        Get dataframes that are in a forward relationship with dataframe

        Args:
            dataframe_name (str): Name of dataframe to search from.
            deep (bool): if True, recursively find forward dataframes.

        Yields a tuple of (descendent_name, path from dataframe_name to descendant).
        """
        for relationship in self.get_forward_relationships(dataframe_name):
            parent_dataframe_name = relationship._parent_dataframe_name
            direct_path = RelationshipPath([(True, relationship)])
            yield parent_dataframe_name, direct_path

            if deep:
                sub_dataframes = self.get_forward_dataframes(
                    parent_dataframe_name,
                    deep=True,
                )
                for sub_dataframe_name, path in sub_dataframes:
                    yield sub_dataframe_name, direct_path + path

    def get_backward_dataframes(self, dataframe_name, deep=False):
        """
        Get dataframes that are in a backward relationship with dataframe

        Args:
            dataframe_name (str): Name of dataframe to search from.
            deep (bool): if True, recursively find backward dataframes.

        Yields a tuple of (descendent_name, path from dataframe_name to descendant).
        """
        for relationship in self.get_backward_relationships(dataframe_name):
            child_dataframe_name = relationship._child_dataframe_name
            direct_path = RelationshipPath([(False, relationship)])
            yield child_dataframe_name, direct_path

            if deep:
                sub_dataframes = self.get_backward_dataframes(
                    child_dataframe_name,
                    deep=True,
                )
                for sub_dataframe_name, path in sub_dataframes:
                    yield sub_dataframe_name, direct_path + path

    def get_forward_relationships(self, dataframe_name):
        """Get relationships where dataframe "dataframe_name" is the child

        Args:
            dataframe_name (str): Name of dataframe to get relationships for.

        Returns:
            list[:class:`.Relationship`]: List of forward relationships.
        """
        return [
            r for r in self.relationships if r._child_dataframe_name == dataframe_name
        ]

    def get_backward_relationships(self, dataframe_name):
        """
        get relationships where dataframe "dataframe_name" is the parent.

        Args:
            dataframe_name (str): Name of dataframe to get relationships for.

        Returns:
            list[:class:`.Relationship`]: list of backward relationships
        """
        return [
            r for r in self.relationships if r._parent_dataframe_name == dataframe_name
        ]

    def has_unique_forward_path(self, start_dataframe_name, end_dataframe_name):
        """
        Is the forward path from start to end unique?

        This will raise if there is no such path.
        """
        paths = self.find_forward_paths(start_dataframe_name, end_dataframe_name)

        next(paths)
        second_path = next(paths, None)

        return not second_path

    ###########################################################################
    #  DataFrame creation methods  ##############################################
    ###########################################################################

    def add_dataframe(
        self,
        dataframe,
        dataframe_name=None,
        index=None,
        logical_types=None,
        semantic_tags=None,
        make_index=False,
        time_index=None,
        secondary_time_index=None,
        already_sorted=False,
    ):
        """
        Add a DataFrame to the EntitySet with Woodwork typing information.

        Args:
            dataframe (pandas.DataFrame) : Dataframe containing the data.

            dataframe_name (str, optional) : Unique name to associate with this dataframe. Must be
                provided if Woodwork is not initialized on the input DataFrame.

            index (str, optional): Name of the column used to index the dataframe.
                Must be unique. If None, take the first column.

            logical_types (dict[str -> Woodwork.LogicalTypes/str, optional]):
                Keys are column names and values are logical types. Will be inferred if not specified.

            semantic_tags (dict[str -> str/set], optional):
                Keys are column names and values are semantic tags.

            make_index (bool, optional) : If True, assume index does not
                exist as a column in dataframe, and create a new column of that name
                using integers. Otherwise, assume index exists.

            time_index (str, optional): Name of the column containing
                time data. Type must be numeric or datetime in nature.

            secondary_time_index (dict[str -> list[str]]): Name of column containing time data to
                be used as a secondary time index mapped to a list of the columns in the dataframe
                associated with that secondary time index.

            already_sorted (bool, optional) : If True, assumes that input dataframe
                is already sorted by time. Defaults to False.

        Notes:

            Will infer logical types from the data.

        Example:
            .. ipython:: python

                import featuretools as ft
                import pandas as pd
                transactions_df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
                                                "session_id": [1, 2, 1, 3, 4, 5],
                                                "amount": [100.40, 20.63, 33.32, 13.12, 67.22, 1.00],
                                                "transaction_time": pd.date_range(start="10:00", periods=6, freq="10s"),
                                                "fraud": [True, False, True, False, True, True]})
                es = ft.EntitySet("example")
                es.add_dataframe(dataframe_name="transactions",
                                 index="id",
                                 time_index="transaction_time",
                                 dataframe=transactions_df)

                es["transactions"]

        """
        logical_types = logical_types or {}
        semantic_tags = semantic_tags or {}

        if len(self.dataframes) > 0:
            if not isinstance(dataframe, type(self.dataframes[0])):
                raise ValueError(
                    "All dataframes must be of the same type. "
                    "Cannot add dataframe of type {} to an entityset with existing dataframes "
                    "of type {}".format(type(dataframe), type(self.dataframes[0])),
                )

        # Only allow string column names
        non_string_names = [
            name for name in dataframe.columns if not isinstance(name, str)
        ]
        if non_string_names:
            raise ValueError(
                "All column names must be strings (Columns {} "
                "are not strings)".format(non_string_names),
            )

        if dataframe.ww.schema is None:
            if dataframe_name is None:
                raise ValueError(
                    "Cannot add dataframe to EntitySet without a name. "
                    "Please provide a value for the dataframe_name parameter.",
                )

            index_was_created, index, dataframe = _get_or_create_index(
                index,
                make_index,
                dataframe,
            )

            dataframe.ww.init(
                name=dataframe_name,
                index=index,
                time_index=time_index,
                logical_types=logical_types,
                semantic_tags=semantic_tags,
                already_sorted=already_sorted,
            )
            if index_was_created:
                dataframe.ww.metadata["created_index"] = index

        else:
            if dataframe.ww.name is None:
                raise ValueError(
                    "Cannot add a Woodwork DataFrame to EntitySet without a name",
                )
            if dataframe.ww.index is None:
                raise ValueError(
                    "Cannot add Woodwork DataFrame to EntitySet without index",
                )

            extra_params = []
            if index is not None:
                extra_params.append("index")
            if time_index is not None:
                extra_params.append("time_index")
            if logical_types:
                extra_params.append("logical_types")
            if make_index:
                extra_params.append("make_index")
            if semantic_tags:
                extra_params.append("semantic_tags")
            if already_sorted:
                extra_params.append("already_sorted")
            if dataframe_name is not None and dataframe_name != dataframe.ww.name:
                extra_params.append("dataframe_name")
            if extra_params:
                warnings.warn(
                    "A Woodwork-initialized DataFrame was provided, so the following parameters were ignored: "
                    + ", ".join(extra_params),
                )

        if dataframe.ww.time_index is not None:
            self._check_uniform_time_index(dataframe)
            self._check_secondary_time_index(dataframe)

        if secondary_time_index:
            self._set_secondary_time_index(
                dataframe,
                secondary_time_index=secondary_time_index,
            )

        dataframe = self._normalize_values(dataframe)

        self.dataframe_dict[dataframe.ww.name] = dataframe
        self.reset_data_description()
        self._add_references_to_metadata(dataframe)

        return self

    def __setitem__(self, key, value):
        self.add_dataframe(dataframe=value, dataframe_name=key)

    def normalize_dataframe(
        self,
        base_dataframe_name,
        new_dataframe_name,
        index,
        additional_columns=None,
        copy_columns=None,
        make_time_index=None,
        make_secondary_time_index=None,
        new_dataframe_time_index=None,
        new_dataframe_secondary_time_index=None,
    ):
        """Create a new dataframe and relationship from unique values of an existing column.

        Args:
            base_dataframe_name (str) : Dataframe name from which to split.

            new_dataframe_name (str): Name of the new dataframe.

            index (str): Column in old dataframe
                that will become index of new dataframe. Relationship
                will be created across this column.

            additional_columns (list[str]):
                List of column names to remove from
                base_dataframe and move to new dataframe.

            copy_columns (list[str]): List of
                column names to copy from old dataframe
                and move to new dataframe.

            make_time_index (bool or str, optional): Create time index for new dataframe based
                on time index in base_dataframe, optionally specifying which column in base_dataframe
                to use for time_index. If specified as True without a specific column name,
                uses the primary time index. Defaults to True if base dataframe has a time index.

            make_secondary_time_index (dict[str -> list[str]], optional): Create a secondary time index
                from key. Values of dictionary are the columns to associate with a secondary time index.
                Only one secondary time index is allowed. If None, only associate the time index.

            new_dataframe_time_index (str, optional): Rename new dataframe time index.

            new_dataframe_secondary_time_index (str, optional): Rename new dataframe secondary time index.

        """
        base_dataframe = self.dataframe_dict[base_dataframe_name]
        additional_columns = additional_columns or []
        copy_columns = copy_columns or []

        for list_name, col_list in {
            "copy_columns": copy_columns,
            "additional_columns": additional_columns,
        }.items():
            if not isinstance(col_list, list):
                raise TypeError(
                    "'{}' must be a list, but received type {}".format(
                        list_name,
                        type(col_list),
                    ),
                )
            if len(col_list) != len(set(col_list)):
                raise ValueError(
                    f"'{list_name}' contains duplicate columns. All columns must be unique.",
                )
            for col_name in col_list:
                if col_name == index:
                    raise ValueError(
                        "Not adding {} as both index and column in {}".format(
                            col_name,
                            list_name,
                        ),
                    )

        for col in additional_columns:
            if col == base_dataframe.ww.time_index:
                raise ValueError(
                    "Not moving {} as it is the base time index column. Perhaps, move the column to the copy_columns.".format(
                        col,
                    ),
                )

        if isinstance(make_time_index, str):
            if make_time_index not in base_dataframe.columns:
                raise ValueError(
                    "'make_time_index' must be a column in the base dataframe",
                )
            elif make_time_index not in additional_columns + copy_columns:
                raise ValueError(
                    "'make_time_index' must be specified in 'additional_columns' or 'copy_columns'",
                )
        if index == base_dataframe.ww.index:
            raise ValueError(
                "'index' must be different from the index column of the base dataframe",
            )

        transfer_types = {}
        # Types will be a tuple of (logical_type, semantic_tags, column_metadata, column_description)
        transfer_types[index] = (
            base_dataframe.ww.logical_types[index],
            base_dataframe.ww.semantic_tags[index],
            base_dataframe.ww.columns[index].metadata,
            base_dataframe.ww.columns[index].description,
        )
        for col_name in additional_columns + copy_columns:
            # Remove any existing time index tags
            transfer_types[col_name] = (
                base_dataframe.ww.logical_types[col_name],
                (base_dataframe.ww.semantic_tags[col_name] - {"time_index"}),
                base_dataframe.ww.columns[col_name].metadata,
                base_dataframe.ww.columns[col_name].description,
            )

        # create and add new dataframe
        new_dataframe = self[base_dataframe_name].copy()

        if make_time_index is None and base_dataframe.ww.time_index is not None:
            make_time_index = True

        if isinstance(make_time_index, str):
            # Set the new time index to make_time_index.
            base_time_index = make_time_index
            new_dataframe_time_index = make_time_index
            already_sorted = new_dataframe_time_index == base_dataframe.ww.time_index
        elif make_time_index:
            # Create a new time index based on the base dataframe time index.
            base_time_index = base_dataframe.ww.time_index
            if new_dataframe_time_index is None:
                new_dataframe_time_index = "first_%s_time" % (base_dataframe.ww.name)

            already_sorted = True

            assert (
                base_dataframe.ww.time_index is not None
            ), "Base dataframe doesn't have time_index defined"

            if base_time_index not in [col for col in copy_columns]:
                copy_columns.append(base_time_index)

                time_index_types = (
                    base_dataframe.ww.logical_types[base_dataframe.ww.time_index],
                    base_dataframe.ww.semantic_tags[base_dataframe.ww.time_index],
                    base_dataframe.ww.columns[base_dataframe.ww.time_index].metadata,
                    base_dataframe.ww.columns[base_dataframe.ww.time_index].description,
                )
            else:
                # If base_time_index is in copy_columns then we've already added the transfer types
                # but since we're changing the name, we have to remove it
                time_index_types = transfer_types[base_dataframe.ww.time_index]
                del transfer_types[base_dataframe.ww.time_index]

            transfer_types[new_dataframe_time_index] = time_index_types

        else:
            new_dataframe_time_index = None
            already_sorted = False

        if new_dataframe_time_index is not None and new_dataframe_time_index == index:
            raise ValueError(
                "time_index and index cannot be the same value, %s"
                % (new_dataframe_time_index),
            )

        selected_columns = (
            [index]
            + [col for col in additional_columns]
            + [col for col in copy_columns]
        )

        new_dataframe = new_dataframe.dropna(subset=[index])
        new_dataframe2 = new_dataframe.drop_duplicates(index, keep="first")[
            selected_columns
        ]

        if make_time_index:
            new_dataframe2 = new_dataframe2.rename(
                columns={base_time_index: new_dataframe_time_index},
            )
        if make_secondary_time_index:
            assert (
                len(make_secondary_time_index) == 1
            ), "Can only provide 1 secondary time index"
            secondary_time_index = list(make_secondary_time_index.keys())[0]

            secondary_columns = [index, secondary_time_index] + list(
                make_secondary_time_index.values(),
            )[0]
            secondary_df = new_dataframe.drop_duplicates(index, keep="last")[
                secondary_columns
            ]
            if new_dataframe_secondary_time_index:
                secondary_df = secondary_df.rename(
                    columns={secondary_time_index: new_dataframe_secondary_time_index},
                )
                secondary_time_index = new_dataframe_secondary_time_index
            else:
                new_dataframe_secondary_time_index = secondary_time_index
            secondary_df = secondary_df.set_index(index)
            new_dataframe = new_dataframe2.join(secondary_df, on=index)
        else:
            new_dataframe = new_dataframe2

        base_dataframe_index = index

        if make_secondary_time_index:
            old_ti_name = list(make_secondary_time_index.keys())[0]
            ti_cols = list(make_secondary_time_index.values())[0]
            ti_cols = [c if c != old_ti_name else secondary_time_index for c in ti_cols]
            make_secondary_time_index = {secondary_time_index: ti_cols}

        # will initialize Woodwork on this DataFrame
        logical_types = {}
        semantic_tags = {}
        column_metadata = {}
        column_descriptions = {}
        for col_name, (ltype, tags, metadata, description) in transfer_types.items():
            logical_types[col_name] = ltype
            semantic_tags[col_name] = tags - {"time_index"}
            column_metadata[col_name] = copy.deepcopy(metadata)
            column_descriptions[col_name] = description

        new_dataframe.ww.init(
            name=new_dataframe_name,
            index=index,
            already_sorted=already_sorted,
            time_index=new_dataframe_time_index,
            logical_types=logical_types,
            semantic_tags=semantic_tags,
            column_metadata=column_metadata,
            column_descriptions=column_descriptions,
        )

        self.add_dataframe(
            new_dataframe,
            secondary_time_index=make_secondary_time_index,
        )

        self.dataframe_dict[base_dataframe_name] = self.dataframe_dict[
            base_dataframe_name
        ].ww.drop(additional_columns)

        self.dataframe_dict[base_dataframe_name].ww.add_semantic_tags(
            {base_dataframe_index: "foreign_key"},
        )

        self.add_relationship(
            new_dataframe_name,
            index,
            base_dataframe_name,
            base_dataframe_index,
        )
        self.reset_data_description()
        return self

    # ###########################################################################
    # #  Data wrangling methods  ###############################################
    # ###########################################################################

    def concat(self, other, inplace=False):
        """Combine entityset with another to create a new entityset with the
        combined data of both entitysets.
        """
        if not self.__eq__(other):
            raise ValueError(
                "Entitysets must have the same dataframes, relationships"
                ", and column names",
            )

        if inplace:
            combined_es = self
        else:
            combined_es = copy.deepcopy(self)

        has_last_time_index = []
        for df in self.dataframes:
            self_df = df
            other_df = other[df.ww.name]
            combined_df = pd.concat([self_df, other_df])
            # If both DataFrames have made indexes, there will likely
            # be overlap in the index column, so we use the other values
            if self_df.ww.metadata.get("created_index") or other_df.ww.metadata.get(
                "created_index",
            ):
                columns = [
                    col
                    for col in combined_df.columns
                    if col != df.ww.index or col != df.ww.time_index
                ]
            else:
                columns = [df.ww.index]
            combined_df.drop_duplicates(columns, inplace=True)

            self_lti_col = df.ww.metadata.get("last_time_index")
            other_lti_col = other[df.ww.name].ww.metadata.get("last_time_index")
            if self_lti_col is not None or other_lti_col is not None:
                has_last_time_index.append(df.ww.name)

            combined_es.replace_dataframe(
                dataframe_name=df.ww.name,
                df=combined_df,
                recalculate_last_time_indexes=False,
                already_sorted=False,
            )

        if has_last_time_index:
            combined_es.add_last_time_indexes(updated_dataframes=has_last_time_index)

        combined_es.reset_data_description()

        return combined_es

    ###########################################################################
    #  Indexing methods  ###############################################
    ###########################################################################
    def add_last_time_indexes(self, updated_dataframes=None):
        """
        Calculates the last time index values for each dataframe (the last time
        an instance or children of that instance were observed).  Used when
        calculating features using training windows. Adds the last time index as
        a series named _ft_last_time on the dataframe.

        Args:
            updated_dataframes (list[str]): List of dataframe names to update last_time_index for
                (will update all parents of those dataframes as well)
        """
        # Generate graph of dataframes to find leaf dataframes
        children = defaultdict(list)  # parent --> child mapping
        child_cols = defaultdict(dict)
        for r in self.relationships:
            children[r._parent_dataframe_name].append(r.child_dataframe)
            child_cols[r._parent_dataframe_name][r._child_dataframe_name] = (
                r.child_column
            )

        updated_dataframes = updated_dataframes or []
        if updated_dataframes:
            # find parents of updated_dataframes
            parent_queue = updated_dataframes[:]
            parents = set()
            while len(parent_queue):
                df_name = parent_queue.pop(0)
                if df_name in parents:
                    continue
                parents.add(df_name)

                for parent_name, _ in self.get_forward_dataframes(df_name):
                    parent_queue.append(parent_name)

            queue = [self[p] for p in parents]
            to_explore = parents
        else:
            to_explore = set(self.dataframe_dict.keys())
            queue = self.dataframes[:]

        explored = set()
        # Store the last time indexes for the entire entityset in a dictionary to update
        es_lti_dict = {}
        for df in self.dataframes:
            lti_col = df.ww.metadata.get("last_time_index")
            if lti_col is not None:
                lti_col = df[lti_col]
            es_lti_dict[df.ww.name] = lti_col

        for df in queue:
            es_lti_dict[df.ww.name] = None

        # We will explore children of dataframes on the queue,
        # which may not be in the to_explore set. Therefore,
        # we check whether all elements of to_explore are in
        # explored, rather than just comparing length
        while not to_explore.issubset(explored):
            dataframe = queue.pop(0)

            if es_lti_dict[dataframe.ww.name] is None:
                if dataframe.ww.time_index is not None:
                    lti = dataframe[dataframe.ww.time_index].copy()
                else:
                    lti = dataframe.ww[dataframe.ww.index].copy()
                    # Cannot have a category dtype with nans when calculating last time index
                    lti = lti.astype("object")
                    lti[:] = None

                es_lti_dict[dataframe.ww.name] = lti

            if dataframe.ww.name in children:
                child_dataframes = children[dataframe.ww.name]

                # if all children not explored, skip for now
                if not set([df.ww.name for df in child_dataframes]).issubset(explored):
                    # Now there is a possibility that a child dataframe
                    # was not explicitly provided in updated_dataframes,
                    # and never made it onto the queue. If updated_dataframes
                    # is None then we just load all dataframes onto the queue
                    # so we didn't need this logic
                    for df in child_dataframes:
                        if df.ww.name not in explored and df.ww.name not in [
                            q.ww.name for q in queue
                        ]:
                            # must also reset last time index here
                            es_lti_dict[df.ww.name] = None
                            queue.append(df)
                    queue.append(dataframe)
                    continue

                # updated last time from all children
                for child_df in child_dataframes:
                    if es_lti_dict[child_df.ww.name] is None:
                        continue
                    link_col = child_cols[dataframe.ww.name][child_df.ww.name].name

                    lti_df = pd.DataFrame(
                        {
                            "last_time": es_lti_dict[child_df.ww.name],
                            dataframe.ww.index: child_df[link_col],
                        },
                    )

                    # sort by time and keep only the most recent
                    lti_df.sort_values(
                        ["last_time", dataframe.ww.index],
                        kind="mergesort",
                        inplace=True,
                    )

                    lti_df.drop_duplicates(
                        dataframe.ww.index,
                        keep="last",
                        inplace=True,
                    )

                    lti_df.set_index(dataframe.ww.index, inplace=True)
                    lti_df = lti_df.reindex(es_lti_dict[dataframe.ww.name].index)
                    lti_df["last_time_old"] = es_lti_dict[dataframe.ww.name]
                    if lti_df.empty:
                        # Pandas errors out if it tries to do fillna and then max on an empty dataframe
                        lti_df = pd.Series([], dtype="object")
                    else:
                        lti_df["last_time"] = lti_df["last_time"].astype(
                            "datetime64[ns]",
                        )
                        lti_df["last_time_old"] = lti_df["last_time_old"].astype(
                            "datetime64[ns]",
                        )
                        lti_df = lti_df.fillna(
                            pd.to_datetime("1800-01-01 00:00"),
                        ).max(axis=1)
                        lti_df = lti_df.replace(
                            pd.to_datetime("1800-01-01 00:00"),
                            pd.NaT,
                        )

                    es_lti_dict[dataframe.ww.name] = lti_df
                    es_lti_dict[dataframe.ww.name].name = "last_time"

            explored.add(dataframe.ww.name)

        # Store the last time index on the DataFrames
        dfs_to_update = {}
        for df in self.dataframes:
            lti = es_lti_dict[df.ww.name]
            if lti is not None:
                if self.time_type == "numeric":
                    if lti.dtype == "datetime64[ns]":
                        # Woodwork cannot convert from datetime to numeric
                        lti = lti.apply(lambda x: x.value)
                    lti = init_series(lti, logical_type="Double")
                else:
                    lti = init_series(lti, logical_type="Datetime")

                lti.name = LTI_COLUMN_NAME

                if LTI_COLUMN_NAME in df.columns:
                    if "last_time_index" in df.ww.semantic_tags[LTI_COLUMN_NAME]:
                        # Remove any previous last time index placed by featuretools
                        df.ww.pop(LTI_COLUMN_NAME)
                    else:
                        raise ValueError(
                            "Cannot add a last time index on DataFrame with an existing "
                            f"'{LTI_COLUMN_NAME}' column. Please rename '{LTI_COLUMN_NAME}'.",
                        )

                # Add the new column to the DataFrame
                df.ww[LTI_COLUMN_NAME] = lti
                if "last_time_index" not in df.ww.semantic_tags[LTI_COLUMN_NAME]:
                    df.ww.add_semantic_tags({LTI_COLUMN_NAME: "last_time_index"})
                df.ww.metadata["last_time_index"] = LTI_COLUMN_NAME

        for df in dfs_to_update.values():
            df.ww.add_semantic_tags({LTI_COLUMN_NAME: "last_time_index"})
            df.ww.metadata["last_time_index"] = LTI_COLUMN_NAME
            self.dataframe_dict[df.ww.name] = df

        self.reset_data_description()
        for df in self.dataframes:
            self._add_references_to_metadata(df)

    # ###########################################################################
    # #  Pickling ###############################################
    # ###########################################################################
    def __getstate__(self):
        return {
            **self.__dict__,
            WW_SCHEMA_KEY: {
                df_name: df.ww.schema for df_name, df in self.dataframe_dict.items()
            },
        }

    def __setstate__(self, state):
        ww_schemas = state.pop(WW_SCHEMA_KEY)
        for df_name, df in state.get("dataframe_dict", {}).items():
            if ww_schemas[df_name] is not None:
                df.ww.init(schema=ww_schemas[df_name], validate=False)

        self.__dict__.update(state)

    # ###########################################################################
    # #  Other ###############################################
    # ###########################################################################
    def add_interesting_values(
        self,
        max_values=5,
        verbose=False,
        dataframe_name=None,
        values=None,
    ):
        """Find or set interesting values for categorical columns, to be used to generate "where" clauses

        Args:
            max_values (int) : Maximum number of values per column to add.
            verbose (bool) : If True, print summary of interesting values found.
            dataframe_name (str) : The dataframe in the EntitySet for which to add interesting values.
                If not specified interesting values will be added for all dataframes.
            values (dict): A dictionary mapping column names to the interesting values to set
                for the column. If specified, a corresponding dataframe_name must also be provided.
                If not specified, interesting values will be set for all eligible columns. If values
                are specified, max_values and verbose parameters will be ignored.

        Returns:
            None

        """
        if dataframe_name is None and values is not None:
            raise ValueError("dataframe_name must be specified if values are provided")

        if dataframe_name is not None and values is not None:
            for column, vals in values.items():
                self[dataframe_name].ww.columns[column].metadata[
                    "interesting_values"
                ] = vals
            return

        if dataframe_name:
            dataframes = [self[dataframe_name]]
        else:
            dataframes = self.dataframes

        def add_value(df, col, val, verbose):
            if verbose:
                msg = "Column {}: Marking {} as an interesting value"
                logger.info(msg.format(col, val))
            interesting_vals = df.ww.columns[col].metadata.get("interesting_values", [])
            interesting_vals.append(val)
            df.ww.columns[col].metadata["interesting_values"] = interesting_vals

        for df in dataframes:
            value_counts = df.ww.value_counts(top_n=max(25, max_values), dropna=True)
            total_count = len(df)

            for col, counts in value_counts.items():
                if {"index", "foreign_key"}.intersection(df.ww.semantic_tags[col]):
                    continue

                for i in range(min(max_values, len(counts))):
                    # Categorical columns will include counts of 0 for all values
                    # in categories. Stop when we encounter a 0 count.
                    if counts[i]["count"] == 0:
                        break
                    if len(counts) < 25:
                        value = counts[i]["value"]
                        add_value(df, col, value, verbose)
                    else:
                        fraction = counts[i]["count"] / total_count
                        if fraction > 0.05 and fraction < 0.95:
                            value = counts[i]["value"]
                            add_value(df, col, value, verbose)
                        else:
                            break

        self.reset_data_description()

    def plot(self, to_file=None):
        """
        Create a UML diagram-ish graph of the EntitySet.

        Args:
            to_file (str, optional) : Path to where the plot should be saved.
                If set to None (as by default), the plot will not be saved.

        Returns:
            graphviz.Digraph : Graph object that can directly be displayed in
                Jupyter notebooks. Nodes of the graph correspond to the DataFrames
                in the EntitySet, showing the typing information for each column.

        Note:
            The typing information displayed for each column is based off of the Woodwork
            ColumnSchema for that column and is represented as ``LogicalType; semantic_tags``,
            but the standard semantic tags have been removed for brevity.
        """
        graphviz = check_graphviz()
        format_ = get_graphviz_format(graphviz=graphviz, to_file=to_file)

        # Initialize a new directed graph
        graph = graphviz.Digraph(
            self.id,
            format=format_,
            graph_attr={"splines": "ortho"},
        )

        # Draw dataframes
        for df in self.dataframes:
            column_typing_info = []
            for col_name, col_schema in df.ww.columns.items():
                col_string = col_name + " : " + str(col_schema.logical_type)

                tags = col_schema.semantic_tags - col_schema.logical_type.standard_tags
                if tags:
                    col_string += "; "
                    col_string += ", ".join(tags)
                column_typing_info.append(col_string)

            columns_string = "\l".join(column_typing_info)  # noqa: W605
            nrows = df.shape[0]
            label = "{%s (%d row%s)|%s\l}" % (  # noqa: W605
                df.ww.name,
                nrows,
                "s" * (nrows > 1),
                columns_string,
            )
            graph.node(df.ww.name, shape="record", label=label)

        # Draw relationships
        for rel in self.relationships:
            # Display the key only once if is the same for both related dataframes
            if rel._parent_column_name == rel._child_column_name:
                label = rel._parent_column_name
            else:
                label = "%s -> %s" % (rel._parent_column_name, rel._child_column_name)

            graph.edge(
                rel._child_dataframe_name,
                rel._parent_dataframe_name,
                xlabel=label,
            )

        if to_file:
            save_graph(graph, to_file, format_)
        return graph

    def _handle_time(
        self,
        dataframe_name,
        df,
        time_last=None,
        training_window=None,
        include_cutoff_time=True,
    ):
        """
        Filter a dataframe for all instances before time_last.
        If the dataframe does not have a time index, return the original
        dataframe.
        """

        schema = self[dataframe_name].ww.schema
        if schema.time_index:
            df_empty = df.empty
            if time_last is not None and not df_empty:
                if include_cutoff_time:
                    df = df[df[schema.time_index] <= time_last]
                else:
                    df = df[df[schema.time_index] < time_last]
                if training_window is not None:
                    training_window = _check_timedelta(training_window)
                    if include_cutoff_time:
                        mask = df[schema.time_index] > time_last - training_window
                    else:
                        mask = df[schema.time_index] >= time_last - training_window
                    lti_col = schema.metadata.get("last_time_index")
                    if lti_col is not None:
                        if include_cutoff_time:
                            lti_mask = df[lti_col] > time_last - training_window
                        else:
                            lti_mask = df[lti_col] >= time_last - training_window
                        mask = mask | lti_mask
                    else:
                        warnings.warn(
                            "Using training_window but last_time_index is "
                            "not set for dataframe %s" % (dataframe_name),
                        )

                    df = df[mask]

        secondary_time_indexes = schema.metadata.get("secondary_time_index") or {}
        for secondary_time_index, columns in secondary_time_indexes.items():
            # should we use ignore time last here?
            if time_last is not None and not df.empty:
                mask = df[secondary_time_index] >= time_last
                df.loc[mask, columns] = np.nan

        return df

    def query_by_values(
        self,
        dataframe_name,
        instance_vals,
        column_name=None,
        columns=None,
        time_last=None,
        training_window=None,
        include_cutoff_time=True,
    ):
        """Query instances that have column with given value

        Args:
            dataframe_name (str): The id of the dataframe to query
            instance_vals (pd.Dataframe, pd.Series, list[str] or str) :
                Instance(s) to match.
            column_name (str) : Column to query on. If None, query on index.
            columns (list[str]) : Columns to return. Return all columns if None.
            time_last (pd.TimeStamp) : Query data up to and including this
                time. Only applies if dataframe has a time index.
            training_window (Timedelta, optional):
                Window defining how much time before the cutoff time data
                can be used when calculating features. If None, all data before cutoff time is used.
            include_cutoff_time (bool):
                If True, data at cutoff time are included in calculating features

        Returns:
            pd.DataFrame : instances that match constraints with ids in order of underlying dataframe
        """
        dataframe = self[dataframe_name]
        if not column_name:
            column_name = dataframe.ww.index

        instance_vals = _vals_to_series(instance_vals, column_name)

        training_window = _check_timedelta(training_window)

        if training_window is not None:
            assert (
                training_window.has_no_observations()
            ), "Training window cannot be in observations"

        if instance_vals is None:
            df = dataframe.copy()

        elif isinstance(instance_vals, pd.Series) and instance_vals.empty:
            df = dataframe.head(0)

        else:
            df = dataframe[dataframe[column_name].isin(instance_vals)]
            df = df.set_index(dataframe.ww.index, drop=False)

            # ensure filtered df has same categories as original
            # workaround for issue below
            # github.com/pandas-dev/pandas/issues/22501#issuecomment-415982538
            #
            # Pandas claims that bug is fixed but it still shows up in some
            # cases.  More investigation needed.
            if dataframe.ww.columns[column_name].is_categorical:
                categories = pd.api.types.CategoricalDtype(
                    categories=dataframe[column_name].cat.categories,
                )
                df[column_name] = df[column_name].astype(categories)

        df = self._handle_time(
            dataframe_name=dataframe_name,
            df=df,
            time_last=time_last,
            training_window=training_window,
            include_cutoff_time=include_cutoff_time,
        )

        if columns is not None:
            df = df[columns]

        return df

    def replace_dataframe(
        self,
        dataframe_name,
        df,
        already_sorted=False,
        recalculate_last_time_indexes=True,
    ):
        """Replace the internal dataframe of an EntitySet table, keeping Woodwork typing information the same.
        Optionally makes sure that data is sorted, that reference indexes to other dataframes are consistent,
        and that last_time_indexes are updated to reflect the new data. If an index was created for the original
        dataframe and is not present on the new dataframe, an index column of the same name will be added to the
        new dataframe.
        """
        if not isinstance(df, type(self[dataframe_name])):
            raise TypeError("Incorrect DataFrame type used")

        # If the original DataFrame has a last time index column and the new one doesnt
        # remove the column and the reference to last time index from that dataframe
        last_time_index_column = self[dataframe_name].ww.metadata.get("last_time_index")
        if (
            last_time_index_column is not None
            and last_time_index_column not in df.columns
        ):
            self[dataframe_name].ww.pop(last_time_index_column)
            del self[dataframe_name].ww.metadata["last_time_index"]

        # If the original DataFrame had an index created via make_index,
        # we may need to remake the index if it's not in the new DataFrame
        created_index = self[dataframe_name].ww.metadata.get("created_index")
        if created_index is not None and created_index not in df.columns:
            df = _create_index(df, created_index)

        old_column_names = list(self[dataframe_name].columns)
        if len(df.columns) != len(old_column_names):
            raise ValueError(
                "New dataframe contains {} columns, expecting {}".format(
                    len(df.columns),
                    len(old_column_names),
                ),
            )
        for col_name in old_column_names:
            if col_name not in df.columns:
                raise ValueError(
                    "New dataframe is missing new {} column".format(col_name),
                )

        if df.ww.schema is not None:
            warnings.warn(
                "Woodwork typing information on new dataframe will be replaced "
                f"with existing typing information from {dataframe_name}",
            )

        df.ww.init(
            schema=self[dataframe_name].ww._schema,
            already_sorted=already_sorted,
        )
        # Make sure column ordering matches original ordering
        df = df.ww[old_column_names]

        df = self._normalize_values(df)

        self.dataframe_dict[dataframe_name] = df

        if self[dataframe_name].ww.time_index is not None:
            self._check_uniform_time_index(self[dataframe_name])

        df_metadata = self[dataframe_name].ww.metadata
        self.set_secondary_time_index(
            dataframe_name,
            df_metadata.get("secondary_time_index"),
        )
        if recalculate_last_time_indexes and last_time_index_column is not None:
            self.add_last_time_indexes(updated_dataframes=[dataframe_name])
        self.reset_data_description()
        self._add_references_to_metadata(df)

    def _check_time_indexes(self):
        for dataframe in self.dataframe_dict.values():
            self._check_uniform_time_index(dataframe)
            self._check_secondary_time_index(dataframe)

    def _check_secondary_time_index(self, dataframe, secondary_time_index=None):
        secondary_time_index = secondary_time_index or dataframe.ww.metadata.get(
            "secondary_time_index",
            {},
        )

        if secondary_time_index and dataframe.ww.time_index is None:
            raise ValueError(
                "Cannot set secondary time index on a DataFrame that has no primary time index.",
            )

        for time_index, columns in secondary_time_index.items():
            self._check_uniform_time_index(dataframe, column_name=time_index)
            if time_index not in columns:
                columns.append(time_index)

    def _check_uniform_time_index(self, dataframe, column_name=None):
        column_name = column_name or dataframe.ww.time_index
        if column_name is None:
            return

        time_type = self._get_time_type(dataframe, column_name)
        if self.time_type is None:
            self.time_type = time_type
        elif self.time_type != time_type:
            info = "%s time index is %s type which differs from other entityset time indexes"
            raise TypeError(info % (dataframe.ww.name, time_type))

    def _get_time_type(self, dataframe, column_name=None):
        column_name = column_name or dataframe.ww.time_index

        column_schema = dataframe.ww.columns[column_name]

        time_type = None
        if column_schema.is_numeric:
            time_type = "numeric"
        elif column_schema.is_datetime:
            time_type = Datetime

        if time_type is None:
            info = "%s time index not recognized as numeric or datetime"
            raise TypeError(info % dataframe.ww.name)
        return time_type

    def _add_references_to_metadata(self, dataframe):
        dataframe.ww.metadata.update(entityset_id=self.id)
        for column in dataframe.columns:
            metadata = dataframe.ww._schema.columns[column].metadata
            metadata.update(dataframe_name=dataframe.ww.name)
            metadata.update(entityset_id=self.id)
        _ES_REF[self.id] = self

    def _normalize_values(self, dataframe):
        def replace(x):
            if not isinstance(x, (list, tuple, np.ndarray)) and pd.isna(x):
                return (np.nan, np.nan)
            else:
                return x

        for column, logical_type in dataframe.ww.logical_types.items():
            if isinstance(logical_type, LatLong):
                dataframe[column] = dataframe[column].apply(replace)
        return dataframe


def _vals_to_series(instance_vals, column_id):
    """
    instance_vals may be a pd.Dataframe, a pd.Series, a list, a single
    value, or None. This function always returns a Series or None.
    """
    if instance_vals is None:
        return None

    # If this is a single value, make it a list
    if not hasattr(instance_vals, "__iter__"):
        instance_vals = [instance_vals]

    # convert iterable to pd.Series
    if isinstance(instance_vals, pd.DataFrame):
        out_vals = instance_vals[column_id]
    else:
        out_vals = pd.Series(instance_vals)

    # no duplicates or NaN values
    out_vals = out_vals.drop_duplicates().dropna()

    # want index to have no name for the merge in query_by_values
    out_vals.index.name = None

    return out_vals


def _get_or_create_index(index, make_index, df):
    """Handles index creation logic base on user input"""
    index_was_created = False

    if index is None:
        # Case 1: user wanted to make index but did not specify column name
        assert not make_index, "Must specify an index name if make_index is True"
        # Case 2: make_index not specified but no index supplied, use first column
        warnings.warn(
            (
                "Using first column as index. "
                "To change this, specify the index parameter"
            ),
        )
        index = df.columns[0]
    elif make_index and index in df.columns:
        # Case 3: user wanted to make index but column already exists
        raise RuntimeError(
            f"Cannot make index: column with name {index} already present",
        )
    elif index not in df.columns:
        if not make_index:
            # Case 4: user names index, it is not in df. does not specify
            # make_index.  Make new index column and warn
            warnings.warn(
                "index {} not found in dataframe, creating new "
                "integer column".format(index),
            )
        # Case 5: make_index with no errors or warnings
        # (Case 4 also uses this code path)
        df = _create_index(df, index)
        index_was_created = True
    # Case 6: user specified index, which is already in df. No action needed.
    return index_was_created, index, df


def _create_index(df, index):
    df.insert(0, index, range(len(df)))
    return df


================================================
FILE: featuretools/entityset/relationship.py
================================================
class Relationship(object):
    """Class to represent a relationship between dataframes

    See Also:
        :class:`.EntitySet`
    """

    def __init__(
        self,
        entityset,
        parent_dataframe_name,
        parent_column_name,
        child_dataframe_name,
        child_column_name,
    ):
        """Create a relationship

        Args:
            entityset (:class:`.EntitySet`): EntitySet to which the relationship belongs
            parent_dataframe_name (str): Name of the parent dataframe in the EntitySet
            parent_column_name (str): Name of the parent column
            child_dataframe_name (str): Name of the child dataframe in the EntitySet
            child_column_name (str): Name of the child column
        """

        self.entityset = entityset
        self._parent_dataframe_name = parent_dataframe_name
        self._child_dataframe_name = child_dataframe_name
        self._parent_column_name = parent_column_name
        self._child_column_name = child_column_name

        if (
            self.parent_dataframe.ww.index is not None
            and self._parent_column_name != self.parent_dataframe.ww.index
        ):
            raise AttributeError(
                f"Parent column '{self._parent_column_name}' is not the index of "
                f"dataframe {self._parent_dataframe_name}",
            )

    @classmethod
    def from_dictionary(cls, arguments, es):
        parent_dataframe = arguments["parent_dataframe_name"]
        child_dataframe = arguments["child_dataframe_name"]
        parent_column = arguments["parent_column_name"]
        child_column = arguments["child_column_name"]
        return cls(es, parent_dataframe, parent_column, child_dataframe, child_column)

    def __repr__(self):
        ret = "<Relationship: %s.%s -> %s.%s>" % (
            self._child_dataframe_name,
            self._child_column_name,
            self._parent_dataframe_name,
            self._parent_column_name,
        )

        return ret

    def __eq__(self, other):
        if not isinstance(other, self.__class__):
            return False

        return (
            self._parent_dataframe_name == other._parent_dataframe_name
            and self._child_dataframe_name == other._child_dataframe_name
            and self._parent_column_name == other._parent_column_name
            and self._child_column_name == other._child_column_name
        )

    def __hash__(self):
        return hash(
            (
                self._parent_dataframe_name,
                self._child_dataframe_name,
                self._parent_column_name,
                self._child_column_name,
            ),
        )

    @property
    def parent_dataframe(self):
        """Parent dataframe object"""
        return self.entityset[self._parent_dataframe_name]

    @property
    def child_dataframe(self):
        """Child dataframe object"""
        return self.entityset[self._child_dataframe_name]

    @property
    def parent_column(self):
        """Column in parent dataframe"""
        return self.parent_dataframe.ww[self._parent_column_name]

    @property
    def child_column(self):
        """Column in child dataframe"""
        return self.child_dataframe.ww[self._child_column_name]

    @property
    def parent_name(self):
        """The name of the parent, relative to the child."""
        if self._is_unique():
            return self._parent_dataframe_name
        else:
            return "%s[%s]" % (self._parent_dataframe_name, self._child_column_name)

    @property
    def child_name(self):
        """The name of the child, relative to the parent."""
        if self._is_unique():
            return self._child_dataframe_name
        else:
            return "%s[%s]" % (self._child_dataframe_name, self._child_column_name)

    def to_dictionary(self):
        return {
            "parent_dataframe_name": self._parent_dataframe_name,
            "child_dataframe_name": self._child_dataframe_name,
            "parent_column_name": self._parent_column_name,
            "child_column_name": self._child_column_name,
        }

    def _is_unique(self):
        """Is there any other relationship with same parent and child dataframes?"""
        es = self.entityset
        relationships = es.get_forward_relationships(self._child_dataframe_name)
        n = len(
            [
                r
                for r in relationships
                if r._parent_dataframe_name == self._parent_dataframe_name
            ],
        )

        assert n > 0, "This relationship is missing from the entityset"

        return n == 1


class RelationshipPath(object):
    def __init__(self, relationships_with_direction):
        self._relationships_with_direction = relationships_with_direction

    @property
    def name(self):
        relationship_names = [
            _direction_name(is_forward, r)
            for is_forward, r in self._relationships_with_direction
        ]

        return ".".join(relationship_names)

    def dataframes(self):
        if self:
            # Yield first dataframe.
            is_forward, relationship = self[0]
            if is_forward:
                yield relationship._child_dataframe_name
            else:
                yield relationship._parent_dataframe_name

        # Yield the dataframe pointed to by each relationship.
        for is_forward, relationship in self:
            if is_forward:
                yield relationship._parent_dataframe_name
            else:
                yield relationship._child_dataframe_name

    def __add__(self, other):
        return RelationshipPath(
            self._relationships_with_direction + other._relationships_with_direction,
        )

    def __getitem__(self, index):
        return self._relationships_with_direction[index]

    def __iter__(self):
        for is_forward, relationship in self._relationships_with_direction:
            yield is_forward, relationship

    def __len__(self):
        return len(self._relationships_with_direction)

    def __eq__(self, other):
        return (
            isinstance(other, RelationshipPath)
            and self._relationships_with_direction
            == other._relationships_with_direction
        )

    def __ne__(self, other):
        return not self == other

    def __repr__(self):
        if self._relationships_with_direction:
            path = "%s.%s" % (next(self.dataframes()), self.name)
        else:
            path = "[]"
        return "<RelationshipPath %s>" % path


def _direction_name(is_forward, relationship):
    if is_forward:
        return relationship.parent_name
    else:
        return relationship.child_name


================================================
FILE: featuretools/entityset/serialize.py
================================================
import datetime
import json
import os
import tarfile
import tempfile

from woodwork.serializers.serializer_base import typing_info_to_dict

from featuretools.utils.s3_utils import get_transport_params, use_smartopen_es
from featuretools.utils.wrangle import _is_s3, _is_url
from featuretools.version import ENTITYSET_SCHEMA_VERSION

FORMATS = ["csv", "pickle", "parquet"]


def entityset_to_description(entityset, format=None):
    """Serialize entityset to data description.

    Args:
        entityset (EntitySet) : Instance of :class:`.EntitySet`.

    Returns:
        description (dict) : Description of :class:`.EntitySet`.
    """

    dataframes = {
        dataframe.ww.name: typing_info_to_dict(dataframe)
        for dataframe in entityset.dataframes
    }
    relationships = [
        relationship.to_dictionary() for relationship in entityset.relationships
    ]

    data_description = {
        "schema_version": ENTITYSET_SCHEMA_VERSION,
        "id": entityset.id,
        "dataframes": dataframes,
        "relationships": relationships,
        "format": format,
    }
    return data_description


def write_data_description(entityset, path, profile_name=None, **kwargs):
    """Serialize entityset to data description and write to disk or S3 path.

    Args:
        entityset (EntitySet) : Instance of :class:`.EntitySet`.
        path (str) : Location on disk or S3 path to write `data_description.json` and dataframe data.
        profile_name (str, bool): The AWS profile specified to write to S3. Will default to None and search for AWS credentials.
            Set to False to use an anonymous profile.
        kwargs (keywords) : Additional keyword arguments to pass as keywords arguments to the underlying serialization method or to specify AWS profile.
    """
    if _is_s3(path):
        with tempfile.TemporaryDirectory() as tmpdir:
            os.makedirs(os.path.join(tmpdir, "data"))
            dump_data_description(entityset, tmpdir, **kwargs)
            file_path = create_archive(tmpdir)

            transport_params = get_transport_params(profile_name)
            use_smartopen_es(
                file_path,
                path,
                read=False,
                transport_params=transport_params,
            )
    elif _is_url(path):
        raise ValueError("Writing to URLs is not supported")
    else:
        path = os.path.abspath(path)
        os.makedirs(os.path.join(path, "data"), exist_ok=True)
        dump_data_description(entityset, path, **kwargs)


def dump_data_description(entityset, path, **kwargs):
    format = kwargs.get("format")
    description = entityset_to_description(entityset, format)
    for df in entityset.dataframes:
        data_path = os.path.join(path, "data", df.ww.name)
        os.makedirs(os.path.join(data_path, "data"), exist_ok=True)
        df.ww.to_disk(data_path, **kwargs)
    file = os.path.join(path, "data_description.json")
    with open(file, "w") as file:
        json.dump(description, file)


def create_archive(tmpdir):
    file_name = "es-{date:%Y-%m-%d_%H%M%S}.tar".format(date=datetime.datetime.now())
    file_path = os.path.join(tmpdir, file_name)
    tar = tarfile.open(str(file_path), "w")
    tar.add(str(tmpdir) + "/data_description.json", arcname="/data_description.json")
    tar.add(str(tmpdir) + "/data", arcname="/data")
    tar.close()
    return file_path


================================================
FILE: featuretools/entityset/timedelta.py
================================================
import pandas as pd
from dateutil.relativedelta import relativedelta


class Timedelta(object):
    """Represents differences in time.

    Timedeltas can be defined in multiple units. Supported units:

    - "ms" : milliseconds
    - "s" : seconds
    - "h" : hours
    - "m" : minutes
    - "d" : days
    - "o"/"observations" : number of individual events
    - "mo" : months
    - "Y" : years

    Timedeltas can also be defined in terms of observations. In this case, the
    Timedelta represents the period spanned by `value`.

    For observation timedeltas:
    >>> three_observations_log = Timedelta(3, "observations")
    >>> three_observations_log.get_name()
    '3 Observations'
    """

    _Observations = "o"

    # units for absolute times
    _absolute_units = ["ms", "s", "h", "m", "d", "w"]
    _relative_units = ["mo", "Y"]

    _readable_units = {
        "ms": "Milliseconds",
        "s": "Seconds",
        "h": "Hours",
        "m": "Minutes",
        "d": "Days",
        "o": "Observations",
        "w": "Weeks",
        "Y": "Years",
        "mo": "Months",
    }

    _readable_to_unit = {v.lower(): k for k, v in _readable_units.items()}

    def __init__(self, value, unit=None, delta_obj=None):
        """
        Args:
            value (float, str, dict) : Value of timedelta, string providing
                both unit and value, or a dictionary of units and times.
            unit (str) : Unit of time delta.
            delta_obj (pd.Timedelta or pd.DateOffset) : A time object used
                internally to do time operations. If None is provided, one will
                be created using the provided value and unit.
        """
        self.check_value(value, unit)
        self.times = self.fix_units()

        if delta_obj is not None:
            self.delta_obj = delta_obj
        else:
            self.delta_obj = self.get_unit_type()

    @classmethod
    def from_dictionary(cls, dictionary):
        dict_units = dictionary["unit"]
        dict_values = dictionary["value"]
        if isinstance(dict_units, str) and isinstance(dict_values, (int, float)):
            return cls({dict_units: dict_values})
        else:
            all_units = dict()
            for i in range(len(dict_units)):
                all_units[dict_units[i]] = dict_values[i]
            return cls(all_units)

    @classmethod
    def make_singular(cls, s):
        if len(s) > 1 and s.endswith("s"):
            return s[:-1]
        return s

    @classmethod
    def _check_unit_plural(cls, s):
        if len(s) > 2 and not s.endswith("s"):
            return (s + "s").lower()
        elif len(s) > 1:
            return s.lower()
        return s

    def get_value(self, unit=None):
        if unit is not None:
            return self.times[unit]
        elif len(self.times.values()) == 1:
            return list(self.times.values())[0]
        else:
            return self.times

    def get_units(self):
        return list(self.times.keys())

    def get_unit_type(self):
        all_units = self.get_units()
        if self._Observations in all_units:
            return None
        elif self.is_absolute() and self.has_multiple_units() is False:
            return pd.Timedelta(self.times[all_units[0]], all_units[0])
        else:
            readable_times = self.lower_readable_times()
            return relativedelta(**readable_times)

    def check_value(self, value, unit):
        if isinstance(value, str):
            from featuretools.utils.wrangle import _check_timedelta

            td = _check_timedelta(value)
            self.times = td.times
        elif isinstance(value, dict):
            self.times = value
        else:
            self.times = {unit: value}

    def fix_units(self):
        fixed_units = dict()
        for unit, value in self.times.items():
            unit = self._check_unit_plural(unit)
            if unit in self._readable_to_unit:
                unit = self._readable_to_unit[unit]
            fixed_units[unit] = value
        return fixed_units

    def lower_readable_times(self):
        readable_times = dict()
        for unit, value in self.times.items():
            readable_unit = self._readable_units[unit].lower()
            readable_times[readable_unit] = value
        return readable_times

    def get_name(self):
        all_units = self.get_units()
        if self.has_multiple_units() is False:
            return "{} {}".format(
                self.times[all_units[0]],
                self._readable_units[all_units[0]],
            )
        final_str = ""
        for unit, value in self.times.items():
            if value == 1:
                unit = self.make_singular(unit)
            final_str += "{} {} ".format(value, self._readable_units[unit])
        return final_str[:-1]

    def get_arguments(self):
        units = list()
        values = list()
        for unit, value in self.times.items():
            units.append(unit)
            values.append(value)
        if len(units) == 1:
            return {"unit": units[0], "value": values[0]}
        else:
            return {"unit": units, "value": values}

    def is_absolute(self):
        for unit in self.get_units():
            if unit not in self._absolute_units:
                return False
        return True

    def has_no_observations(self):
        for unit in self.get_units():
            if unit in self._Observations:
                return False
        return True

    def has_multiple_units(self):
        if len(self.get_units()) > 1:
            return True
        else:
            return False

    def __eq__(self, other):
        if not isinstance(other, Timedelta):
            return False

        return self.times == other.times

    def __neg__(self):
        """Negate the timedelta"""
        new_times = dict()
        for unit, value in self.times.items():
            new_times[unit] = -value
        if self.delta_obj is not None:
            return Timedelta(new_times, delta_obj=-self.delta_obj)
        else:
            return Timedelta(new_times)

    def __radd__(self, time):
        """Add the Timedelta to a timestamp value"""
        if self._Observations not in self.get_units():
            return time + self.delta_obj
        else:
            raise Exception("Invalid unit")

    def __rsub__(self, time):
        """Subtract the Timedelta from a timestamp value"""
        if self._Observations not in self.get_units():
            return time - self.delta_obj
        else:
            raise Exception("Invalid unit")


================================================
FILE: featuretools/exceptions.py
================================================
class UnknownFeature(Exception):
    def __init__(self, *args, **kwargs):
        Exception.__init__(self, *args, **kwargs)


class UnusedPrimitiveWarning(UserWarning):
    pass


================================================
FILE: featuretools/feature_base/__init__.py
================================================
# flake8: noqa
from featuretools.feature_base.api import *


================================================
FILE: featuretools/feature_base/api.py
================================================
# flake8: noqa
from featuretools.feature_base.feature_base import (
    AggregationFeature,
    DirectFeature,
    Feature,
    FeatureBase,
    FeatureOutputSlice,
    GroupByTransformFeature,
    IdentityFeature,
    TransformFeature,
)
from featuretools.feature_base.feature_descriptions import describe_feature
from featuretools.feature_base.feature_visualizer import graph_feature
from featuretools.feature_base.features_deserializer import load_features
from featuretools.feature_base.features_serializer import save_features


================================================
FILE: featuretools/feature_base/cache.py
================================================
"""
cache.py

Custom caching class, currently used for FeatureBase
"""

# needed for defaultdict annotation if < python 3.9
from __future__ import annotations

from collections import defaultdict
from dataclasses import dataclass, field
from enum import Enum
from typing import Any, List, Optional, Union


class CacheType(Enum):
    """Enumerates the supported cache types"""

    DEPENDENCY = 1
    DEPTH = 2


@dataclass()
class FeatureCache:
    """Provides caching for the defined types"""

    enabled: bool = False
    cache: defaultdict[dict] = field(default_factory=lambda: defaultdict(dict))

    def get(
        self,
        cache_type: CacheType,
        hashkey: int,
    ) -> Optional[Union[List[Any], Any]]:
        """Gets the cache entry, if enabled and defined

        Args:
            cache_type (CacheType): type of cache
            hashkey (int): hash key

        Returns:
            Optional[Union[List[Any], Any]]: payload assigned to the hashkey
        """
        if not self.enabled or cache_type not in self.cache:
            return None
        return self.cache[cache_type].get(hashkey, None)

    def add(self, cache_type: CacheType, hashkey: int, payload: Any):
        """Adds an entry to the cache, if enabled

        Args:
            cache_type (CacheType): type of cache
            hashkey (int): hash key
            payload (Any): payload to assign
        """
        if self.enabled:
            self.cache[cache_type][hashkey] = payload

    def clear_all(self):
        """Clears the cache collections"""
        self.cache.clear()


feature_cache = FeatureCache()


================================================
FILE: featuretools/feature_base/feature_base.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable

from featuretools import primitives
from featuretools.entityset.relationship import Relationship, RelationshipPath
from featuretools.entityset.timedelta import Timedelta
from featuretools.feature_base.utils import is_valid_input
from featuretools.primitives.base import (
    AggregationPrimitive,
    PrimitiveBase,
    TransformPrimitive,
)
from featuretools.utils.wrangle import _check_time_against_column, _check_timedelta

_ES_REF = {}


class FeatureBase(object):
    def __init__(
        self,
        dataframe,
        base_features,
        relationship_path,
        primitive,
        name=None,
        names=None,
    ):
        """Base class for all features

        Args:
            entityset (EntitySet): entityset this feature is being calculated for
            dataframe (DataFrame): dataframe for calculating this feature
            base_features (list[FeatureBase]): list of base features for primitive
            relationship_path (RelationshipPath): path from this dataframe to the
                dataframe of the base features.
            primitive (:class:`.PrimitiveBase`): primitive to calculate. if not initialized when passed, gets initialized with no arguments
        """
        assert all(
            isinstance(f, FeatureBase) for f in base_features
        ), "All base features must be features"

        self.dataframe_name = dataframe.ww.name
        self.entityset = _ES_REF[dataframe.ww.metadata["entityset_id"]]

        self.base_features = base_features

        # initialize if not already initialized
        if not isinstance(primitive, PrimitiveBase):
            primitive = primitive()

        self.primitive = primitive

        self.relationship_path = relationship_path

        self._name = name

        self._names = names

        assert self._check_input_types(), (
            "Provided inputs don't match input " "type requirements"
        )

    def __getitem__(self, key):
        assert (
            self.number_output_features > 1
        ), "can only access slice of multi-output feature"
        assert (
            self.number_output_features > key
        ), "index is higher than the number of outputs"
        return FeatureOutputSlice(self, key)

    @classmethod
    def from_dictionary(cls, arguments, entityset, dependencies, primitive):
        raise NotImplementedError("Must define from_dictionary on FeatureBase subclass")

    def rename(self, name):
        """Rename Feature, returns copy. Will reset any custom feature column names
        to their default value."""
        feature_copy = self.copy()
        feature_copy._name = name
        feature_copy._names = None
        return feature_copy

    def copy(self):
        raise NotImplementedError("Must define copy on FeatureBase subclass")

    def get_name(self):
        if not self._name:
            self._name = self.generate_name()
        return self._name

    def get_feature_names(self):
        if not self._names:
            if self.number_output_features == 1:
                self._names = [self.get_name()]
            else:
                self._names = self.generate_names()
                if self.get_name() != self.generate_name():
                    self._names = [
                        self.get_name() + "[{}]".format(i)
                        for i in range(len(self._names))
                    ]
        return self._names

    def set_feature_names(self, names):
        """Set new values for the feature column names, overriding the default values.
        Number of names provided must match the number of output columns defined for
        the feature, and all provided names should be unique. Only works for features
        that have more than one output column. Use ``Feature.rename`` to change the column
        name for single output features.

        Args:
            names (list[str]): List of names to use for the output feature columns. Provided
                names must be unique.
        """
        if self.number_output_features == 1:
            raise ValueError(
                "The set_feature_names can only be used on features that have more than one output column.",
            )

        num_new_names = len(names)
        if self.number_output_features != num_new_names:
            raise ValueError(
                "Number of names provided must match the number of output features:"
                f" {num_new_names} name(s) provided, {self.number_output_features} expected.",
            )

        if len(set(names)) != num_new_names:
            raise ValueError("Provided output feature names must be unique.")

        self._names = names

    def get_function(self, **kwargs):
        return self.primitive.get_function(**kwargs)

    def get_dependencies(self, deep=False, ignored=None, copy=True):
        """Returns features that are used to calculate this feature

        ..note::

            If you only want the features that make up the input to the feature
            function use the base_features attribute instead.


        """
        deps = []

        for d in self.base_features[:]:
            deps += [d]

        if hasattr(self, "where") and self.where:
            deps += [self.where]

        if ignored is None:
            ignored = set([])
        deps = [d for d in deps if d.unique_name() not in ignored]

        if deep:
            for dep in deps[:]:  # copy so we don't modify list we iterate over
                deep_deps = dep.get_dependencies(deep, ignored)
                deps += deep_deps

        return deps

    def get_depth(self, stop_at=None):
        """Returns depth of feature"""
        max_depth = 0
        stop_at_set = set()
        if stop_at is not None:
            stop_at_set = set([i.unique_name() for i in stop_at])
            if self.unique_name() in stop_at_set:
                return 0
        for dep in self.get_dependencies(deep=True, ignored=stop_at_set):
            max_depth = max(dep.get_depth(stop_at=stop_at), max_depth)
        return max_depth + 1

    def _check_input_types(self):
        if len(self.base_features) == 0:
            return True

        input_types = self.primitive.input_types
        if input_types is not None:
            if not isinstance(input_types[0], list):
                input_types = [input_types]

            for t in input_types:
                zipped = list(zip(t, self.base_features))
                if all([is_valid_input(f.column_schema, t) for t, f in zipped]):
                    return True
        else:
            return True
        return False

    @property
    def dataframe(self):
        """Dataframe this feature belongs too"""
        return self.entityset[self.dataframe_name]

    @property
    def number_output_features(self):
        return self.primitive.number_output_features

    def __repr__(self):
        return "<Feature: %s>" % (self.get_name())

    def hash(self):
        return hash(self.get_name() + self.dataframe_name)

    def __hash__(self):
        return self.hash()

    @property
    def column_schema(self):
        feature = self
        column_schema = self.primitive.return_type

        while column_schema is None:
            # get column_schema of first base feature
            base_feature = feature.base_features[0]
            column_schema = base_feature.column_schema

            # only the original time index should exist
            # so make this feature's return type just a Datetime
            if "time_index" in column_schema.semantic_tags:
                column_schema = ColumnSchema(
                    logical_type=column_schema.logical_type,
                    semantic_tags=column_schema.semantic_tags - {"time_index"},
                )
            elif "index" in column_schema.semantic_tags:
                column_schema = ColumnSchema(
                    logical_type=column_schema.logical_type,
                    semantic_tags=column_schema.semantic_tags - {"index"},
                )
                # Need to add back in the numeric standard tag so the schema can get recognized
                # as a valid return type
                if column_schema.is_numeric:
                    column_schema.semantic_tags.add("numeric")
                if column_schema.is_categorical:
                    column_schema.semantic_tags.add("category")

            # direct features should keep the foreign key tag, but all other features should get converted
            if (
                not isinstance(feature, DirectFeature)
                and "foreign_key" in column_schema.semantic_tags
            ):
                column_schema = ColumnSchema(
                    logical_type=column_schema.logical_type,
                    semantic_tags=column_schema.semantic_tags - {"foreign_key"},
                )

            feature = base_feature

        return column_schema

    @property
    def default_value(self):
        return self.primitive.default_value

    def get_arguments(self):
        raise NotImplementedError("Must define get_arguments on FeatureBase subclass")

    def to_dictionary(self):
        return {
            "type": type(self).__name__,
            "dependencies": [dep.unique_name() for dep in self.get_dependencies()],
            "arguments": self.get_arguments(),
        }

    def _handle_binary_comparison(self, other, Primitive, PrimitiveScalar):
        if isinstance(other, FeatureBase):
            return Feature([self, other], primitive=Primitive)

        return Feature([self], primitive=PrimitiveScalar(other))

    def __eq__(self, other):
        """Compares to other by equality"""
        return self._handle_binary_comparison(
            other,
            primitives.Equal,
            primitives.EqualScalar,
        )

    def __ne__(self, other):
        """Compares to other by non-equality"""
        return self._handle_binary_comparison(
            other,
            primitives.NotEqual,
            primitives.NotEqualScalar,
        )

    def __gt__(self, other):
        """Compares if greater than other"""
        return self._handle_binary_comparison(
            other,
            primitives.GreaterThan,
            primitives.GreaterThanScalar,
        )

    def __ge__(self, other):
        """Compares if greater than or equal to other"""
        return self._handle_binary_comparison(
            other,
            primitives.GreaterThanEqualTo,
            primitives.GreaterThanEqualToScalar,
        )

    def __lt__(self, other):
        """Compares if less than other"""
        return self._handle_binary_comparison(
            other,
            primitives.LessThan,
            primitives.LessThanScalar,
        )

    def __le__(self, other):
        """Compares if less than or equal to other"""
        return self._handle_binary_comparison(
            other,
            primitives.LessThanEqualTo,
            primitives.LessThanEqualToScalar,
        )

    def __add__(self, other):
        """Add other"""
        return self._handle_binary_comparison(
            other,
            primitives.AddNumeric,
            primitives.AddNumericScalar,
        )

    def __radd__(self, other):
        return self.__add__(other)

    def __sub__(self, other):
        """Subtract other"""
        return self._handle_binary_comparison(
            other,
            primitives.SubtractNumeric,
            primitives.SubtractNumericScalar,
        )

    def __rsub__(self, other):
        return Feature([self], primitive=primitives.ScalarSubtractNumericFeature(other))

    def __div__(self, other):
        """Divide by other"""
        return self._handle_binary_comparison(
            other,
            primitives.DivideNumeric,
            primitives.DivideNumericScalar,
        )

    def __truediv__(self, other):
        return self.__div__(other)

    def __rtruediv__(self, other):
        return self.__rdiv__(other)

    def __rdiv__(self, other):
        return Feature([self], primitive=primitives.DivideByFeature(other))

    def __mul__(self, other):
        """Multiply by other"""
        if isinstance(other, FeatureBase):
            if all(
                [
                    isinstance(f.column_schema.logical_type, (Boolean, BooleanNullable))
                    for f in (self, other)
                ],
            ):
                return Feature([self, other], primitive=primitives.MultiplyBoolean)
            if (
                "numeric" in self.column_schema.semantic_tags
                and isinstance(
                    other.column_schema.logical_type,
                    (Boolean, BooleanNullable),
                )
                or "numeric" in other.column_schema.semantic_tags
                and isinstance(
                    self.column_schema.logical_type,
                    (Boolean, BooleanNullable),
                )
            ):
                return Feature(
                    [self, other],
                    primitive=primitives.MultiplyNumericBoolean,
                )
        return self._handle_binary_comparison(
            other,
            primitives.MultiplyNumeric,
            primitives.MultiplyNumericScalar,
        )

    def __rmul__(self, other):
        return self.__mul__(other)

    def __mod__(self, other):
        """Take modulus of other"""
        return self._handle_binary_comparison(
            other,
            primitives.ModuloNumeric,
            primitives.ModuloNumericScalar,
        )

    def __rmod__(self, other):
        return Feature([self], primitive=primitives.ModuloByFeature(other))

    def __and__(self, other):
        return self.AND(other)

    def __rand__(self, other):
        return Feature([other, self], primitive=primitives.And)

    def __or__(self, other):
        return self.OR(other)

    def __ror__(self, other):
        return Feature([other, self], primitive=primitives.Or)

    def __not__(self, other):
        return self.NOT(other)

    def __abs__(self):
        return Feature([self], primitive=primitives.Absolute)

    def __neg__(self):
        return Feature([self], primitive=primitives.Negate)

    def AND(self, other_feature):
        """Logical AND with other_feature"""
        return Feature([self, other_feature], primitive=primitives.And)

    def OR(self, other_feature):
        """Logical OR with other_feature"""
        return Feature([self, other_feature], primitive=primitives.Or)

    def NOT(self):
        """Creates inverse of feature"""
        return Feature([self], primitive=primitives.Not)

    def isin(self, list_of_output):
        return Feature(
            [self],
            primitive=primitives.IsIn(list_of_outputs=list_of_output),
        )

    def is_null(self):
        """Compares feature to null by equality"""
        return Feature([self], primitive=primitives.IsNull)

    def __invert__(self):
        return self.NOT()

    def unique_name(self):
        return "%s: %s" % (self.dataframe_name, self.get_name())

    def relationship_path_name(self):
        return self.relationship_path.name


class IdentityFeature(FeatureBase):
    """Feature for dataframe that is equivalent to underlying column"""

    def __init__(self, column, name=None):
        self.column_name = column.ww.name
        self.return_type = column.ww.schema

        metadata = column.ww.schema._metadata
        es = _ES_REF[metadata["entityset_id"]]
        super(IdentityFeature, self).__init__(
            dataframe=es[metadata["dataframe_name"]],
            base_features=[],
            relationship_path=RelationshipPath([]),
            primitive=PrimitiveBase,
            name=name,
        )

    @classmethod
    def from_dictionary(cls, arguments, entityset, dependencies, primitive):
        dataframe_name = arguments["dataframe_name"]
        column_name = arguments["column_name"]
        column = entityset[dataframe_name].ww[column_name]
        return cls(column=column, name=arguments["name"])

    def copy(self):
        """Return copy of feature"""
        return IdentityFeature(self.entityset[self.dataframe_name].ww[self.column_name])

    def generate_name(self):
        return self.column_name

    def get_depth(self, stop_at=None):
        return 0

    def get_arguments(self):
        return {
            "name": self.get_name(),
            "column_name": self.column_name,
            "dataframe_name": self.dataframe_name,
        }

    @property
    def column_schema(self):
        return self.return_type


class DirectFeature(FeatureBase):
    """Feature for child dataframe that inherits
    a feature value from a parent dataframe"""

    input_types = [ColumnSchema()]
    return_type = None

    def __init__(
        self,
        base_feature,
        child_dataframe_name,
        relationship=None,
        name=None,
    ):
        base_feature = _validate_base_features(base_feature)[0]
        self.parent_dataframe_name = base_feature.dataframe_name
        relationship = self._handle_relationship(
            base_feature.entityset,
            child_dataframe_name,
            relationship,
        )
        child_dataframe = base_feature.entityset[child_dataframe_name]
        super(DirectFeature, self).__init__(
            dataframe=child_dataframe,
            base_features=[base_feature],
            relationship_path=RelationshipPath([(True, relationship)]),
            primitive=PrimitiveBase,
            name=name,
        )

    def _handle_relationship(self, entityset, child_dataframe_name, relationship):
        child_dataframe = entityset[child_dataframe_name]
        if relationship:
            relationship_child = relationship.child_dataframe
            assert (
                child_dataframe.ww.name == relationship_child.ww.name
            ), "child_dataframe must be the relationship child dataframe"

            assert (
                self.parent_dataframe_name == relationship.parent_dataframe.ww.name
            ), "Base feature must be defined on the relationship parent dataframe"
        else:
            child_relationships = entityset.get_forward_relationships(
                child_dataframe.ww.name,
            )
            possible_relationships = (
                r
                for r in child_relationships
                if r.parent_dataframe.ww.name == self.parent_dataframe_name
            )
            relationship = next(possible_relationships, None)

            if not relationship:
                raise RuntimeError(
                    'No relationship from "%s" to "%s" found.'
                    % (child_dataframe.ww.name, self.parent_dataframe_name),
                )

            # Check for another path.
            elif next(possible_relationships, None):
                message = (
                    "There are multiple relationships to the base dataframe. "
                    "You must specify a relationship."
                )
                raise RuntimeError(message)

        return relationship

    @classmethod
    def from_dictionary(cls, arguments, entityset, dependencies, primitive):
        base_feature = dependencies[arguments["base_feature"]]
        relationship = Relationship.from_dictionary(
            arguments["relationship"],
            entityset,
        )
        child_dataframe_name = relationship.child_dataframe.ww.name
        return cls(
            base_feature=base_feature,
            child_dataframe_name=child_dataframe_name,
            relationship=relationship,
            name=arguments["name"],
        )

    @property
    def number_output_features(self):
        return self.base_features[0].number_output_features

    @property
    def default_value(self):
        return self.base_features[0].default_value

    def copy(self):
        """Return copy of feature"""
        _is_forward, relationship = self.relationship_path[0]
        return DirectFeature(
            self.base_features[0],
            self.dataframe_name,
            relationship=relationship,
        )

    @property
    def column_schema(self):
        return self.base_features[0].column_schema

    def generate_name(self):
        return self._name_from_base(self.base_features[0].get_name())

    def generate_names(self):
        return [
            self._name_from_base(base_name)
            for base_name in self.base_features[0].get_feature_names()
        ]

    def get_arguments(self):
        _is_forward, relationship = self.relationship_path[0]
        return {
            "name": self.get_name(),
            "base_feature": self.base_features[0].unique_name(),
            "relationship": relationship.to_dictionary(),
        }

    def _name_from_base(self, base_name):
        return "%s.%s" % (self.relationship_path_name(), base_name)


class AggregationFeature(FeatureBase):
    # Feature to condition this feature by in
    # computation (e.g. take the Count of products where the product_id is
    # "basketball".)
    where = None
    #: (str or :class:`.Timedelta`): Use only some amount of previous data from
    # each time point during calculation
    use_previous = None

    def __init__(
        self,
        base_features,
        parent_dataframe_name,
        primitive,
        relationship_path=None,
        use_previous=None,
        where=None,
        name=None,
    ):
        base_features = _validate_base_features(base_features)

        for bf in base_features:
            if bf.number_output_features > 1:
                raise ValueError("Cannot stack on whole multi-output feature.")

        self.child_dataframe_name = base_features[0].dataframe_name
        entityset = base_features[0].entityset
        relationship_path, self._path_is_unique = self._handle_relationship_path(
            entityset,
            parent_dataframe_name,
            relationship_path,
        )

        self.parent_dataframe_name = parent_dataframe_name

        if where is not None:
            self.where = _validate_base_features(where)[0]
            msg = "Where feature must be defined on child dataframe {}".format(
                self.child_dataframe_name,
            )
            assert self.where.dataframe_name == self.child_dataframe_name, msg

        if use_previous:
            assert entityset[self.child_dataframe_name].ww.time_index is not None, (
                "Applying function that requires time index to dataframe that "
                "doesn't have one"
            )
            self.use_previous = _check_timedelta(use_previous)
            assert len(base_features) > 0
            time_index = base_features[0].dataframe.ww.time_index
            time_col = base_features[0].dataframe.ww[time_index]
            assert time_index is not None, (
                "Use previous can only be defined " "on dataframes with a time index"
            )
            assert _check_time_against_column(self.use_previous, time_col)

        super(AggregationFeature, self).__init__(
            dataframe=entityset[parent_dataframe_name],
            base_features=base_features,
            relationship_path=relationship_path,
            primitive=primitive,
            name=name,
        )

    def _handle_relationship_path(
        self,
        entityset,
        parent_dataframe_name,
        relationship_path,
    ):
        parent_dataframe = entityset[parent_dataframe_name]
        child_dataframe = entityset[self.child_dataframe_name]

        if relationship_path:
            assert all(
                not is_forward for is_forward, _r in relationship_path
            ), "All relationships in path must be backward"

            _is_forward, first_relationship = relationship_path[0]
            first_parent = first_relationship.parent_dataframe
            assert (
                parent_dataframe.ww.name == first_parent.ww.name
            ), "parent_dataframe must match first relationship in path."

            _is_forward, last_relationship = relationship_path[-1]
            assert (
                child_dataframe.ww.name == last_relationship.child_dataframe.ww.name
            ), "Base feature must be defined on the dataframe at the end of relationship_path"

            path_is_unique = entityset.has_unique_forward_path(
                child_dataframe.ww.name,
                parent_dataframe.ww.name,
            )
        else:
            paths = entityset.find_backward_paths(
                parent_dataframe.ww.name,
                child_dataframe.ww.name,
            )
            first_path = next(paths, None)

            if not first_path:
                raise RuntimeError(
                    'No backward path from "%s" to "%s" found.'
                    % (parent_dataframe.ww.name, child_dataframe.ww.name),
                )
            # Check for another path.
            elif next(paths, None):
                message = (
                    "There are multiple possible paths to the base dataframe. "
                    "You must specify a relationship path."
                )
                raise RuntimeError(message)

            relationship_path = RelationshipPath([(False, r) for r in first_path])
            path_is_unique = True

        return relationship_path, path_is_unique

    @classmethod
    def from_dictionary(cls, arguments, entityset, dependencies, primitive):
        base_features = [dependencies[name] for name in arguments["base_features"]]
        relationship_path = [
            Relationship.from_dictionary(r, entityset)
            for r in arguments["relationship_path"]
        ]
        parent_dataframe_name = relationship_path[0].parent_dataframe.ww.name
        relationship_path = RelationshipPath([(False, r) for r in relationship_path])

        use_previous_data = arguments["use_previous"]
        use_previous = use_previous_data and Timedelta.from_dictionary(
            use_previous_data,
        )

        where_name = arguments["where"]
        where = where_name and dependencies[where_name]

        feat = cls(
            base_features=base_features,
            parent_dataframe_name=parent_dataframe_name,
            primitive=primitive,
            relationship_path=relationship_path,
            use_previous=use_previous,
            where=where,
            name=arguments["name"],
        )
        feat._names = arguments.get("feature_names")
        return feat

    def copy(self):
        return AggregationFeature(
            self.base_features,
            parent_dataframe_name=self.parent_dataframe_name,
            relationship_path=self.relationship_path,
            primitive=self.primitive,
            use_previous=self.use_previous,
            where=self.where,
        )

    def _where_str(self):
        if self.where is not None:
            where_str = " WHERE " + self.where.get_name()
        else:
            where_str = ""
        return where_str

    def _use_prev_str(self):
        if self.use_previous is not None and hasattr(self.use_previous, "get_name"):
            use_prev_str = ", Last {}".format(self.use_previous.get_name())
        else:
            use_prev_str = ""
        return use_prev_str

    def generate_name(self):
        return self.primitive.generate_name(
            base_feature_names=[bf.get_name() for bf in self.base_features],
            relationship_path_name=self.relationship_path_name(),
            parent_dataframe_name=self.parent_dataframe_name,
            where_str=self._where_str(),
            use_prev_str=self._use_prev_str(),
        )

    def generate_names(self):
        return self.primitive.generate_names(
            base_feature_names=[bf.get_name() for bf in self.base_features],
            relationship_path_name=self.relationship_path_name(),
            parent_dataframe_name=self.parent_dataframe_name,
            where_str=self._where_str(),
            use_prev_str=self._use_prev_str(),
        )

    def get_arguments(self):
        arg_dict = {
            "name": self.get_name(),
            "base_features": [feat.unique_name() for feat in self.base_features],
            "relationship_path": [r.to_dictionary() for _, r in self.relationship_path],
            "primitive": self.primitive,
            "where": self.where and self.where.unique_name(),
            "use_previous": self.use_previous and self.use_previous.get_arguments(),
        }
        if self.number_output_features > 1:
            arg_dict["feature_names"] = self.get_feature_names()
        return arg_dict

    def relationship_path_name(self):
        if self._path_is_unique:
            return self.child_dataframe_name
        else:
            return self.relationship_path.name


class TransformFeature(FeatureBase):
    def __init__(self, base_features, primitive, name=None):
        base_features = _validate_base_features(base_features)

        for bf in base_features:
            if bf.number_output_features > 1:
                raise ValueError("Cannot stack on whole multi-output feature.")
        dataframe = base_features[0].entityset[base_features[0].dataframe_name]
        super(TransformFeature, self).__init__(
            dataframe=dataframe,
            base_features=base_features,
            relationship_path=RelationshipPath([]),
            primitive=primitive,
            name=name,
        )

    @classmethod
    def from_dictionary(cls, arguments, entityset, dependencies, primitive):
        base_features = [dependencies[name] for name in arguments["base_features"]]
        feat = cls(
            base_features=base_features,
            primitive=primitive,
            name=arguments["name"],
        )
        feat._names = arguments.get("feature_names")
        return feat

    def copy(self):
        return TransformFeature(self.base_features, self.primitive)

    def generate_name(self):
        return self.primitive.generate_name(
            base_feature_names=[bf.get_name() for bf in self.base_features],
        )

    def generate_names(self):
        return self.primitive.generate_names(
            base_feature_names=[bf.get_name() for bf in self.base_features],
        )

    def get_arguments(self):
        arg_dict = {
            "name": self.get_name(),
            "base_features": [feat.unique_name() for feat in self.base_features],
            "primitive": self.primitive,
        }
        if self.number_output_features > 1:
            arg_dict["feature_names"] = self.get_feature_names()
        return arg_dict


class GroupByTransformFeature(TransformFeature):
    def __init__(self, base_features, primitive, groupby, name=None):
        if not isinstance(groupby, FeatureBase):
            groupby = IdentityFeature(groupby)
        assert (
            len({"category", "foreign_key"} - groupby.column_schema.semantic_tags) < 2
        )
        self.groupby = groupby

        base_features = _validate_base_features(base_features)
        base_features.append(groupby)

        super(GroupByTransformFeature, self).__init__(
            base_features=base_features,
            primitive=primitive,
            name=name,
        )

    @classmethod
    def from_dictionary(cls, arguments, entityset, dependencies, primitive):
        base_features = [dependencies[name] for name in arguments["base_features"]]
        groupby = dependencies[arguments["groupby"]]
        feat = cls(
            base_features=base_features,
            primitive=primitive,
            groupby=groupby,
            name=arguments["name"],
        )
        feat._names = arguments.get("feature_names")
        return feat

    def copy(self):
        # the groupby feature is appended to base_features in the __init__
        # so here we separate them again
        return GroupByTransformFeature(
            self.base_features[:-1],
            self.primitive,
            self.groupby,
        )

    def generate_name(self):
        # exclude the groupby feature from base_names since it has a special
        # place in the feature name
        base_names = [bf.get_name() for bf in self.base_features[:-1]]
        _name = self.primitive.generate_name(base_names)
        return "{} by {}".format(_name, self.groupby.get_name())

    def generate_names(self):
        base_names = [bf.get_name() for bf in self.base_features[:-1]]
        _names = self.primitive.generate_names(base_names)
        names = [name + " by {}".format(self.groupby.get_name()) for name in _names]
        return names

    def get_arguments(self):
        # Do not include groupby in base_features.
        feature_names = [
            feat.unique_name()
            for feat in self.base_features
            if feat.unique_name() != self.groupby.unique_name()
        ]
        arg_dict = {
            "name": self.get_name(),
            "base_features": feature_names,
            "primitive": self.primitive,
            "groupby": self.groupby.unique_name(),
        }
        if self.number_output_features > 1:
            arg_dict["feature_names"] = self.get_feature_names()
        return arg_dict


class Feature(object):
    """
    Alias to create feature. Infers the feature type based on init parameters.
    """

    def __new__(
        self,
        base,
        dataframe_name=None,
        groupby=None,
        parent_dataframe_name=None,
        primitive=None,
        use_previous=None,
        where=None,
    ):
        # either direct or identity
        if primitive is None and dataframe_name is None:
            return IdentityFeature(base)
        elif primitive is None and dataframe_name is not None:
            return DirectFeature(base, dataframe_name)
        elif primitive is not None and parent_dataframe_name is not None:
            assert isinstance(primitive, AggregationPrimitive) or issubclass(
                primitive,
                AggregationPrimitive,
            )
            return AggregationFeature(
                base,
                parent_dataframe_name=parent_dataframe_name,
                use_previous=use_previous,
                where=where,
                primitive=primitive,
            )
        elif primitive is not None:
            assert isinstance(primitive, TransformPrimitive) or issubclass(
                primitive,
                TransformPrimitive,
            )
            if groupby is not None:
                return GroupByTransformFeature(
                    base,
                    primitive=primitive,
                    groupby=groupby,
                )
            return TransformFeature(base, primitive=primitive)

        raise Exception("Unrecognized feature initialization")


class FeatureOutputSlice(FeatureBase):
    """
    Class to access specific multi output feature column
    """

    def __init__(self, base_feature, n, name=None):
        base_features = [base_feature]
        self.num_output_parent = base_feature.number_output_features

        msg = "cannot access slice from single output feature"
        assert self.num_output_parent > 1, msg
        msg = "cannot access column that is not between 0 and " + str(
            self.num_output_parent - 1,
        )
        assert n < self.num_output_parent, msg

        self.n = n
        self._name = name
        self._names = [name] if name else None
        self.base_features = base_features
        self.base_feature = base_features[0]

        self.dataframe_name = base_feature.dataframe_name
        self.entityset = base_feature.entityset
        self.primitive = base_feature.primitive

        self.relationship_path = base_feature.relationship_path

    def __getitem__(self, key):
        raise ValueError("Cannot get item from slice of multi output feature")

    def generate_name(self):
        return self.base_feature.get_feature_names()[self.n]

    @property
    def number_output_features(self):
        return 1

    def get_arguments(self):
        return {
            "name": self.get_name(),
            "base_feature": self.base_feature.unique_name(),
            "n": self.n,
        }

    @classmethod
    def from_dictionary(cls, arguments, entityset, dependencies, primitive):
        base_feature_name = arguments["base_feature"]
        base_feature = dependencies[base_feature_name]
        n = arguments["n"]
        name = arguments["name"]
        return cls(base_feature=base_feature, n=n, name=name)

    def copy(self):
        return FeatureOutputSlice(self.base_feature, self.n)


def _validate_base_features(feature):
    if "Series" == type(feature).__name__:
        return [IdentityFeature(feature)]
    elif hasattr(feature, "__iter__"):
        features = [_validate_base_features(f)[0] for f in feature]
        msg = "all base features must share the same dataframe"
        assert len(set([bf.dataframe_name for bf in features])) == 1, msg
        return features
    elif isinstance(feature, FeatureBase):
        return [feature]
    else:
        raise Exception("Not a feature")


================================================
FILE: featuretools/feature_base/feature_descriptions.py
================================================
import json

import featuretools as ft


def describe_feature(
    feature,
    feature_descriptions=None,
    primitive_templates=None,
    metadata_file=None,
):
    """Generates an English language description of a feature.

    Args:
        feature (FeatureBase) : Feature to describe
        feature_descriptions (dict, optional) : dictionary mapping features or unique
            feature names to custom descriptions
        primitive_templates (dict, optional) : dictionary mapping primitives or
            primitive names to description templates
        metadata_file (str, optional) : path to json metadata file

    Returns:
        str : English description of the feature
    """
    feature_descriptions = feature_descriptions or {}
    primitive_templates = primitive_templates or {}

    if metadata_file:
        file_feature_descriptions, file_primitive_templates = parse_json_metadata(
            metadata_file,
        )
        feature_descriptions = {**file_feature_descriptions, **feature_descriptions}
        primitive_templates = {**file_primitive_templates, **primitive_templates}

    description = generate_description(
        feature,
        feature_descriptions,
        primitive_templates,
    )
    return description[:1].upper() + description[1:] + "."


def generate_description(feature, feature_descriptions, primitive_templates):
    # Check if feature has custom description
    if feature in feature_descriptions or feature.unique_name() in feature_descriptions:
        description = feature_descriptions.get(feature) or feature_descriptions.get(
            feature.unique_name(),
        )
        return description

    # Check if identity feature:
    if isinstance(feature, ft.IdentityFeature):
        description = feature.column_schema.description
        if description is None:
            description = 'the "{}"'.format(feature.column_name)
        return description

    # Handle direct features
    if isinstance(feature, ft.DirectFeature):
        base_feature, direct_description = get_direct_description(feature)
        direct_base = generate_description(
            base_feature,
            feature_descriptions,
            primitive_templates,
        )
        return direct_base + direct_description

    # Get input descriptions
    input_descriptions = []
    input_columns = feature.base_features
    if isinstance(feature, ft.feature_base.FeatureOutputSlice):
        input_columns = feature.base_feature.base_features

    for input_col in input_columns:
        col_description = generate_description(
            input_col,
            feature_descriptions,
            primitive_templates,
        )
        input_descriptions.append(col_description)

    # Remove groupby description from input columns
    groupby_description = None
    if isinstance(feature, ft.GroupByTransformFeature):
        groupby_description = input_descriptions.pop()

    # Generate primitive description
    template_override = None
    if (
        feature.primitive in primitive_templates
        or feature.primitive.name in primitive_templates
    ):
        template_override = primitive_templates.get(
            feature.primitive,
        ) or primitive_templates.get(feature.primitive.name)
    slice_num = feature.n if hasattr(feature, "n") else None
    primitive_description = feature.primitive.get_description(
        input_descriptions,
        slice_num=slice_num,
        template_override=template_override,
    )
    if isinstance(feature, ft.feature_base.FeatureOutputSlice):
        feature = feature.base_feature

    # Generate groupby phrase if applicable
    groupby = ""
    if isinstance(feature, ft.AggregationFeature):
        groupby_description = get_aggregation_groupby(feature, feature_descriptions)
    if groupby_description is not None:
        if groupby_description.startswith("the "):
            groupby_description = groupby_description[4:]
        groupby = "for each {}".format(groupby_description)

    # Generate aggregation dataframe phrase with use_previous
    dataframe_description = ""
    if isinstance(feature, ft.AggregationFeature):
        if feature.use_previous:
            dataframe_description = "of the previous {} of ".format(
                feature.use_previous.get_name().lower(),
            )
        else:
            dataframe_description = "of all instances of "
        dataframe_description += '"{}"'.format(
            feature.relationship_path[-1][1].child_dataframe.ww.name,
        )

    # Generate where phrase
    where = ""
    if hasattr(feature, "where") and feature.where:
        where_col = generate_description(
            feature.where.base_features[0],
            feature_descriptions,
            primitive_templates,
        )
        where = "where {} is {}".format(where_col, feature.where.primitive.value)

    # Join all parts of template
    description_template = [
        primitive_description,
        dataframe_description,
        where,
        groupby,
    ]
    description = " ".join([phrase for phrase in description_template if phrase != ""])

    return description


def get_direct_description(feature):
    direct_description = (
        ' the instance of "{}" associated with this ' 'instance of "{}"'.format(
            feature.relationship_path[-1][1].parent_dataframe.ww.name,
            feature.dataframe_name,
        )
    )
    base_features = feature.base_features
    # shortens stacked direct features to make it easier to understand
    while isinstance(base_features[0], ft.DirectFeature):
        base_feat = base_features[0]
        base_feat_description = ' the instance of "{}" associated ' "with".format(
            base_feat.relationship_path[-1][1].parent_dataframe.ww.name,
        )
        direct_description = base_feat_description + direct_description
        base_features = base_feat.base_features
    direct_description = " for" + direct_description

    return base_features[0], direct_description


def get_aggregation_groupby(feature, feature_descriptions=None):
    if feature_descriptions is None:
        feature_descriptions = {}
    groupby_name = feature.dataframe.ww.index
    groupby = ft.IdentityFeature(
        feature.entityset[feature.dataframe_name].ww[groupby_name],
    )
    if groupby in feature_descriptions or groupby.unique_name() in feature_descriptions:
        return feature_descriptions.get(groupby) or feature_descriptions.get(
            groupby.unique_name(),
        )
    else:
        return '"{}" in "{}"'.format(groupby_name, feature.dataframe_name)


def parse_json_metadata(file):
    with open(file) as f:
        json_metadata = json.load(f)

    return (
        json_metadata.get("feature_descriptions", {}),
        json_metadata.get("primitive_templates", {}),
    )


================================================
FILE: featuretools/feature_base/feature_visualizer.py
================================================
import html

from featuretools.feature_base.feature_base import (
    AggregationFeature,
    DirectFeature,
    FeatureOutputSlice,
    IdentityFeature,
    TransformFeature,
)
from featuretools.feature_base.feature_descriptions import describe_feature
from featuretools.utils.plot_utils import (
    check_graphviz,
    get_graphviz_format,
    save_graph,
)

TARGET_COLOR = "#D9EAD3"
TABLE_TEMPLATE = """<
<TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="10">
    <TR>
        <TD colspan="1" bgcolor="#A9A9A9"><B>{dataframe_name}</B></TD>
    </TR>{table_cols}
</TABLE>>"""
COL_TEMPLATE = """<TR><TD ALIGN="LEFT" port="{}">{}</TD></TR>"""
TARGET_TEMPLATE = """
    <TR>
        <TD ALIGN="LEFT" port="{}" BGCOLOR="{target_color}">{}</TD>
    </TR>""".format(
    "{}",
    "{}",
    target_color=TARGET_COLOR,
)


def graph_feature(feature, to_file=None, description=False, **kwargs):
    """Generates a feature lineage graph for the given feature

    Args:
        feature (FeatureBase) : Feature to generate lineage graph for
        to_file (str, optional) : Path to where the plot should be saved.
            If set to None (as by default), the plot will not be saved.
        description (bool or str, optional): The feature description to use as a caption
            for the graph. If False, no description is added. Set to True
            to use an auto-generated description. Defaults to False.
        kwargs (keywords): Additional keyword arguments to pass as keyword arguments
            to the ft.describe_feature function.

    Returns:
        graphviz.Digraph : Graph object that can directly be displayed in Jupyter notebooks.
    """
    graphviz = check_graphviz()
    format_ = get_graphviz_format(graphviz=graphviz, to_file=to_file)

    # Initialize a new directed graph
    graph = graphviz.Digraph(
        feature.get_name(),
        format=format_,
        graph_attr={"rankdir": "LR"},
    )

    dataframes = {}
    edges = ([], [])
    primitives = []
    groupbys = []

    _, max_depth = get_feature_data(
        feature,
        dataframes,
        groupbys,
        edges,
        primitives,
        layer=0,
    )
    dataframes[feature.dataframe_name]["targets"].add(feature.get_name())

    for df_name in dataframes:
        dataframe_name = (
            "\u2605 {} (target)".format(df_name)
            if df_name == feature.dataframe_name
            else df_name
        )
        dataframe_table = get_dataframe_table(dataframe_name, dataframes[df_name])
        graph.attr("node", shape="plaintext")
        graph.node(df_name, dataframe_table)

    graph.attr("node", shape="diamond")
    num_primitives = len(primitives)
    for prim_name, prim_label, layer, prim_type in primitives:
        step_num = max_depth - layer
        if num_primitives == 1:
            type_str = (
                '<FONT POINT-SIZE="12"><B>{}</B><BR></BR></FONT>'.format(prim_type)
                if prim_type
                else ""
            )
            prim_label = "<{}{}>".format(type_str, prim_label)
        else:
            step = "Step {}".format(step_num)
            type_str = "   " + prim_type if prim_type else ""
            prim_label = (
                '<<FONT POINT-SIZE="12"><B>{}:</B>{}<BR></BR></FONT>{}>'.format(
                    step,
                    type_str,
                    prim_label,
                )
            )

        # sink first layer transform primitive if multiple primitives
        if step_num == 1 and prim_type == "Transform" and num_primitives > 1:
            with graph.subgraph() as init_transform:
                init_transform.attr(rank="min")
                init_transform.node(name=prim_name, label=prim_label)
        else:
            graph.node(name=prim_name, label=prim_label)

    graph.attr("node", shape="box")
    for groupby_name, groupby_label in groupbys:
        graph.node(name=groupby_name, label=groupby_label)

    graph.attr("edge", style="solid", dir="forward")
    for edge in edges[1]:
        graph.edge(*edge)

    graph.attr("edge", style="dotted", arrowhead="none", dir="forward")
    for edge in edges[0]:
        graph.edge(*edge)

    if description is True:
        graph.attr(label=describe_feature(feature, **kwargs))
    elif description is not False:
        graph.attr(label=description)

    if to_file:
        save_graph(graph, to_file, format_)

    return graph


def get_feature_data(feat, dataframes, groupbys, edges, primitives, layer=0):
    # 1) add feature to dataframes tables:
    feat_name = feat.get_name()
    if feat.dataframe_name not in dataframes:
        add_dataframe(feat.dataframe, dataframes)
    dataframe_dict = dataframes[feat.dataframe_name]

    # if we've already explored this feat, continue
    feat_node = "{}:{}".format(feat.dataframe_name, feat_name)
    if feat_name in dataframe_dict["columns"] or feat_name in dataframe_dict["feats"]:
        return feat_node, layer

    if isinstance(feat, IdentityFeature):
        dataframe_dict["columns"].add(feat_name)
    else:
        dataframe_dict["feats"].add(feat_name)
    base_node = feat_node

    # 2) if multi-output, convert feature to generic base
    if isinstance(feat, FeatureOutputSlice):
        feat = feat.base_feature
        feat_name = feat.get_name()

    # 3) add primitive node
    if feat.primitive.name or isinstance(feat, DirectFeature):
        prim_name = feat.primitive.name if feat.primitive.name else "join"
        prim_type = ""
        if isinstance(feat, AggregationFeature):
            prim_type = "Aggregation"
        elif isinstance(feat, TransformFeature):
            prim_type = "Transform"
        primitive_node = "{}_{}_{}".format(layer, feat_name, prim_name)
        primitives.append((primitive_node, prim_name.upper(), layer, prim_type))

        edges[1].append([primitive_node, base_node])
        base_node = primitive_node

    # 4) add groupby/join edges and nodes
    dependencies = [(dep.hash(), dep) for dep in feat.get_dependencies()]
    for is_forward, r in feat.relationship_path:
        if is_forward:
            if r.child_dataframe.ww.name not in dataframes:
                add_dataframe(r.child_dataframe, dataframes)
            dataframes[r.child_dataframe.ww.name]["columns"].add(r._child_column_name)
            child_node = "{}:{}".format(r.child_dataframe.ww.name, r._child_column_name)
            edges[0].append([base_node, child_node])
        else:
            if r.child_dataframe.ww.name not in dataframes:
                add_dataframe(r.child_dataframe, dataframes)
            dataframes[r.child_dataframe.ww.name]["columns"].add(r._child_column_name)
            child_node = "{}:{}".format(r.child_dataframe.ww.name, r._child_column_name)
            child_name = child_node.replace(":", "--")
            groupby_node = "{}_groupby_{}".format(feat_name, child_name)
            groupby_name = "group by\n{}".format(r._child_column_name)
            groupbys.append((groupby_node, groupby_name))
            edges[0].append([child_node, groupby_node])
            edges[1].append([groupby_node, base_node])
            base_node = groupby_node

    if hasattr(feat, "groupby"):
        groupby = feat.groupby
        _ = get_feature_data(
            groupby,
            dataframes,
            groupbys,
            edges,
            primitives,
            layer + 1,
        )
        dependencies.remove((groupby.hash(), groupby))

        groupby_name = groupby.get_name()
        if isinstance(groupby, IdentityFeature):
            dataframes[groupby.dataframe_name]["columns"].add(groupby_name)
        else:
            dataframes[groupby.dataframe_name]["feats"].add(groupby_name)

        child_node = "{}:{}".format(groupby.dataframe_name, groupby_name)
        child_name = child_node.replace(":", "--")
        groupby_node = "{}_groupby_{}".format(feat_name, child_name)
        groupby_name = "group by\n{}".format(groupby_name)
        groupbys.append((groupby_node, groupby_name))
        edges[0].append([child_node, groupby_node])
        edges[1].append([groupby_node, base_node])
        base_node = groupby_node

    # 5) recurse over dependents
    max_depth = layer
    for _, f in dependencies:
        dependent_node, depth = get_feature_data(
            f,
            dataframes,
            groupbys,
            edges,
            primitives,
            layer + 1,
        )
        edges[1].append([dependent_node, base_node])

        max_depth = max(depth, max_depth)

    return feat_node, max_depth


def add_dataframe(dataframe, dataframe_dict):
    dataframe_dict[dataframe.ww.name] = {
        "index": dataframe.ww.index,
        "targets": set(),
        "columns": set(),
        "feats": set(),
    }


def get_dataframe_table(dataframe_name, dataframe_dict):
    """
    given a dict of columns and feats, construct the html table for it
    """
    index = dataframe_dict["index"]
    targets = dataframe_dict["targets"]
    columns = dataframe_dict["columns"].difference(targets)
    feats = dataframe_dict["feats"].difference(targets)

    # If the index is used, make sure it's the first element in the table
    clean_index = html.escape(index)
    if index in columns:
        rows = [COL_TEMPLATE.format(clean_index, clean_index + " (index)")]
        columns.discard(index)
    elif index in targets:
        rows = [TARGET_TEMPLATE.format(clean_index, clean_index + " (index)")]
        targets.discard(index)
    else:
        rows = []

    for col in list(columns) + list(feats) + list(targets):
        template = COL_TEMPLATE
        if col in targets:
            template = TARGET_TEMPLATE

        col = html.escape(col)
        rows.append(template.format(col, col))

    table = TABLE_TEMPLATE.format(
        dataframe_name=dataframe_name,
        table_cols="\n".join(rows),
    )
    return table


================================================
FILE: featuretools/feature_base/features_deserializer.py
================================================
import json

from featuretools.entityset.deserialize import (
    description_to_entityset as deserialize_es,
)
from featuretools.feature_base.feature_base import (
    AggregationFeature,
    DirectFeature,
    Feature,
    FeatureBase,
    FeatureOutputSlice,
    GroupByTransformFeature,
    IdentityFeature,
    TransformFeature,
)
from featuretools.primitives.utils import PrimitivesDeserializer
from featuretools.utils.s3_utils import get_transport_params, use_smartopen_features
from featuretools.utils.schema_utils import check_schema_version
from featuretools.utils.wrangle import _is_s3, _is_url


def load_features(features, profile_name=None):
    """Loads the features from a filepath, S3 path, URL, an open file, or a JSON formatted string.

    Args:
        features (str or :class:`.FileObject`): The file location of saved features.
        This must either be the name of the file, a JSON formatted string, or a readable file handle.

        profile_name (str, bool): The AWS profile specified to write to S3. Will default to None and search for AWS credentials.
            Set to False to use an anonymous profile.

    Returns:
        features (list[:class:`.FeatureBase`]): Feature definitions list.

    Note:
        Features saved in one version of Featuretools or Python are not guaranteed to work in another.
        After upgrading Featuretools or Python, features may need to be generated again.

    Example:
        .. ipython:: python
            :suppress:

            import featuretools as ft
            import os

        .. code-block:: python

            # Option 1
            filepath = os.path.join('/Home/features/', 'list.json')
            features = ft.load_features(filepath)

            # Option 2
            filepath = os.path.join('/Home/features/', 'list.json')
            with open(filepath, 'r') as f:
                features = ft.load_features(f)

            # Option 3
            filepath = os.path.join('/Home/features/', 'list.json')
            with open(filepath, 'r') as :
                feature_str = f.read()
            features = ft.load_features(feature_str)


    .. seealso::
        :func:`.save_features`
    """
    return FeaturesDeserializer.load(features, profile_name).to_list()


class FeaturesDeserializer(object):
    FEATURE_CLASSES = {
        "AggregationFeature": AggregationFeature,
        "DirectFeature": DirectFeature,
        "Feature": Feature,
        "FeatureBase": FeatureBase,
        "GroupByTransformFeature": GroupByTransformFeature,
        "IdentityFeature": IdentityFeature,
        "TransformFeature": TransformFeature,
        "FeatureOutputSlice": FeatureOutputSlice,
    }

    def __init__(self, features_dict):
        self.features_dict = features_dict
        self._check_schema_version()
        self.entityset = deserialize_es(features_dict["entityset"])
        self._deserialized_features = {}  # name -> feature
        primitive_deserializer = PrimitivesDeserializer()
        primitive_definitions = features_dict["primitive_definitions"]
        self._deserialized_primitives = {
            k: primitive_deserializer.deserialize_primitive(v)
            for k, v in primitive_definitions.items()
        }

    @classmethod
    def load(cls, features, profile_name):
        if isinstance(features, str):
            try:
                features_dict = json.loads(features)
            except ValueError:
                if _is_url(features) or _is_s3(features):
                    transport_params = None
                    if _is_s3(features):
                        transport_params = get_transport_params(profile_name)
                    features_dict = use_smartopen_features(
                        features,
                        transport_params=transport_params,
                    )
                else:
                    with open(features, "r") as f:
                        features_dict = json.load(f)
            return cls(features_dict)
        return cls(json.load(features))

    def to_list(self):
        feature_names = self.features_dict["feature_list"]
        return [self._deserialize_feature(name) for name in feature_names]

    def _deserialize_feature(self, feature_name):
        if feature_name in self._deserialized_features:
            return self._deserialized_features[feature_name]

        feature_dict = self.features_dict["feature_definitions"][feature_name]
        dependencies_list = feature_dict["dependencies"]
        primitive = None
        primitive_id = feature_dict["arguments"].get("primitive")
        if primitive_id is not None:
            primitive = self._deserialized_primitives[primitive_id]

        # Collect dependencies into a dictionary of name -> feature.
        dependencies = {
            dependency: self._deserialize_feature(dependency)
            for dependency in dependencies_list
        }

        type = feature_dict["type"]
        cls = self.FEATURE_CLASSES.get(type)
        if not cls:
            raise RuntimeError('Unrecognized feature type "%s"' % type)

        args = feature_dict["arguments"]
        feature = cls.from_dictionary(args, self.entityset, dependencies, primitive)

        self._deserialized_features[feature_name] = feature
        return feature

    def _check_schema_version(self):
        check_schema_version(self, "features")


================================================
FILE: featuretools/feature_base/features_serializer.py
================================================
import json

from featuretools.primitives.utils import serialize_primitive
from featuretools.utils.s3_utils import get_transport_params, use_smartopen_features
from featuretools.utils.wrangle import _is_s3, _is_url
from featuretools.version import FEATURES_SCHEMA_VERSION
from featuretools.version import __version__ as ft_version


def save_features(features, location=None, profile_name=None):
    """Saves the features list as JSON to a specified filepath/S3 path, writes to an open file, or
    returns the serialized features as a JSON string. If no file provided, returns a string.

    Args:
        features (list[:class:`.FeatureBase`]): List of Feature definitions.

        location (str or :class:`.FileObject`, optional): The location of where to save
            the features list which must include the name of the file,
            or a writeable file handle to write to. If location is None, will return a JSON string
            of the serialized features.
            Default: None

        profile_name (str, bool): The AWS profile specified to write to S3. Will default to None and search for AWS credentials.
                                    Set to False to use an anonymous profile.

    Note:
        Features saved in one version of Featuretools are not guaranteed to work in another.
        After upgrading Featuretools, features may need to be generated again.

    Example:
        .. ipython:: python
            :suppress:

            from featuretools.tests.testing_utils import (
                make_ecommerce_entityset)
            import featuretools as ft
            es = make_ecommerce_entityset()
            import os

        .. code-block:: python

            f1 = ft.Feature(es["log"].ww["product_id"])
            f2 = ft.Feature(es["log"].ww["purchased"])
            f3 = ft.Feature(es["log"].ww["value"])

            features = [f1, f2, f3]

            # Option 1
            filepath = os.path.join('/Home/features/', 'list.json')
            ft.save_features(features, filepath)

            # Option 2
            filepath = os.path.join('/Home/features/', 'list.json')
            with open(filepath, 'w') as f:
                ft.save_features(features, f)

            # Option 3
            features_string = ft.save_features(features)
    .. seealso::
        :func:`.load_features`
    """
    return FeaturesSerializer(features).save(location, profile_name=profile_name)


class FeaturesSerializer(object):
    def __init__(self, feature_list):
        self.feature_list = feature_list
        self._features_dict = None

    def to_dict(self):
        names_list = [feat.unique_name() for feat in self.feature_list]
        es = self.feature_list[0].entityset

        feature_defs, primitive_defs = self._feature_definitions()

        return {
            "schema_version": FEATURES_SCHEMA_VERSION,
            "ft_version": ft_version,
            "entityset": es.to_dictionary(),
            "feature_list": names_list,
            "feature_definitions": feature_defs,
            "primitive_definitions": primitive_defs,
        }

    def save(self, location, profile_name):
        features_dict = self.to_dict()
        if location is None:
            return json.dumps(features_dict)
        if isinstance(location, str):
            if _is_url(location):
                raise ValueError("Writing to URLs is not supported")
            if _is_s3(location):
                transport_params = get_transport_params(profile_name)
                use_smartopen_features(
                    location,
                    features_dict,
                    transport_params,
                    read=False,
                )
            else:
                with open(location, "w") as f:
                    json.dump(features_dict, f)
        else:
            json.dump(features_dict, location)

    def _feature_definitions(self):
        if not self._features_dict:
            self._features_dict = {}
            self._primitives_dict = {}

            for feature in self.feature_list:
                self._serialize_feature(feature)

            primitive_number = 0
            primitive_id_to_key = {}
            for name, feature in self._features_dict.items():
                primitive = feature["arguments"].get("primitive")
                if primitive:
                    primitive_id = id(primitive)
                    if primitive_id not in primitive_id_to_key.keys():
                        # Primitive we haven't seen before, add to dict and increment primitive_id counter
                        # Always use string for keys because json conversion results in integer dict keys
                        # being converted to strings, but integer dict values are not.
                        primitives_dict_key = str(primitive_number)
                        primitive_id_to_key[primitive_id] = primitives_dict_key
                        self._primitives_dict[primitives_dict_key] = (
                            serialize_primitive(primitive)
                        )
                        self._features_dict[name]["arguments"]["primitive"] = (
                            primitives_dict_key
                        )
                        primitive_number += 1
                    else:
                        # Primitive we have seen already - use existing primitive_id key
                        key = primitive_id_to_key[primitive_id]
                        self._features_dict[name]["arguments"]["primitive"] = key

        return self._features_dict, self._primitives_dict

    def _serialize_feature(self, feature):
        name = feature.unique_name()

        if name not in self._features_dict:
            self._features_dict[feature.unique_name()] = feature.to_dictionary()

            for dependency in feature.get_dependencies(deep=True):
                name = dependency.unique_name()
                if name not in self._features_dict:
                    self._features_dict[name] = dependency.to_dictionary()


================================================
FILE: featuretools/feature_base/utils.py
================================================
def is_valid_input(candidate, template):
    """Checks if a candidate schema should be considered a match for a template schema"""
    if template.logical_type is not None and not isinstance(
        candidate.logical_type,
        type(template.logical_type),
    ):
        return False
    if len(template.semantic_tags - candidate.semantic_tags):
        return False
    return True


================================================
FILE: featuretools/feature_discovery/FeatureCollection.py
================================================
from __future__ import annotations

import hashlib
from itertools import combinations
from typing import Any, Dict, List, Optional, Set, Type, Union, cast

from woodwork.logical_types import LogicalType

from featuretools.feature_discovery.LiteFeature import LiteFeature
from featuretools.feature_discovery.type_defs import ANY
from featuretools.feature_discovery.utils import hash_primitive, logical_types_map
from featuretools.primitives.base.primitive_base import PrimitiveBase
from featuretools.primitives.utils import (
    PrimitivesDeserializer,
)


class FeatureCollection:
    def __init__(self, features: List[LiteFeature]):
        self._all_features: List[LiteFeature] = features
        self.indexed = False
        self.sorted = False
        self._hash_key: Optional[str] = None

    def sort_features(self):
        if not self.sorted:
            self._all_features = sorted(self._all_features)
            self.sorted = True

    def __repr__(self):
        return f"<FeatureCollection ({self.hash_key[:5]}) n_features={len(self._all_features)} indexed={self.indexed}>"

    @property
    def all_features(self):
        return self._all_features.copy()

    @property
    def hash_key(self) -> str:
        if self._hash_key is None:
            if not self.sorted:
                self.sort_features()
            self._set_hash()
        assert self._hash_key is not None
        return self._hash_key

    def _set_hash(self):
        hash_msg = hashlib.sha256()

        for feature in self._all_features:
            hash_msg.update(feature.id.encode("utf-8"))

        self._hash_key = hash_msg.hexdigest()
        return self

    def __hash__(self):
        return hash(self.hash_key)

    def __eq__(self, other: FeatureCollection) -> bool:
        return self.hash_key == other.hash_key

    def reindex(self) -> FeatureCollection:
        self.by_logical_type: Dict[
            Union[Type[LogicalType], None],
            Set[LiteFeature],
        ] = {}
        self.by_tag: Dict[str, Set[LiteFeature]] = {}
        self.by_origin_feature: Dict[LiteFeature, Set[LiteFeature]] = {}
        self.by_depth: Dict[int, Set[LiteFeature]] = {}
        self.by_name: Dict[str, LiteFeature] = {}
        self.by_key: Dict[str, List[LiteFeature]] = {}

        for feature in self._all_features:
            for key in self.feature_to_keys(feature):
                self.by_key.setdefault(key, []).append(feature)

            logical_type = feature.logical_type
            self.by_logical_type.setdefault(logical_type, set()).add(feature)

            tags = feature.tags
            for tag in tags:
                self.by_tag.setdefault(tag, set()).add(feature)

            origin_features = feature.get_origin_features()
            for origin_feature in origin_features:
                self.by_origin_feature.setdefault(origin_feature, set()).add(feature)

            if feature.depth == 0:
                self.by_origin_feature.setdefault(feature, set()).add(feature)

            feature_name = feature.name
            assert feature_name is not None
            assert feature_name not in self.by_name

            self.by_name[feature_name] = feature

        self.indexed = True

        return self

    def get_by_logical_type(self, logical_type: Type[LogicalType]) -> Set[LiteFeature]:
        return self.by_logical_type.get(logical_type, set())

    def get_by_tag(self, tag: str) -> Set[LiteFeature]:
        return self.by_tag.get(tag, set())

    def get_by_origin_feature(self, origin_feature: LiteFeature) -> Set[LiteFeature]:
        return self.by_origin_feature.get(origin_feature, set())

    def get_by_origin_feature_name(self, name: str) -> Union[LiteFeature, None]:
        feature = self.by_name.get(name)
        return feature

    def get_dependencies_by_origin_name(self, name) -> Set[LiteFeature]:
        origin_feature = self.by_name.get(name)
        if origin_feature:
            return self.by_origin_feature[origin_feature]
        return set()

    def get_by_key(self, key: str) -> List[LiteFeature]:
        return self.by_key.get(key, [])

    def flatten_features(self) -> Dict[str, LiteFeature]:
        all_features_dict: Dict[str, LiteFeature] = {}

        def rfunc(feature_list: List[LiteFeature]):
            for feature in feature_list:
                all_features_dict.setdefault(feature.id, feature)
                rfunc(feature.base_features)

        rfunc(self._all_features)
        return all_features_dict

    def flatten_primitives(self) -> Dict[str, Dict[str, Any]]:
        all_primitives_dict: Dict[str, Dict[str, Any]] = {}

        def rfunc(feature_list: List[LiteFeature]):
            for feature in feature_list:
                if feature.primitive:
                    key, prim_dict = hash_primitive(feature.primitive)
                    all_primitives_dict.setdefault(key, prim_dict)
                rfunc(feature.base_features)

        rfunc(self._all_features)
        return all_primitives_dict

    def to_dict(self):
        all_primitives_dict = self.flatten_primitives()
        all_features_dict = self.flatten_features()

        return {
            "primitives": all_primitives_dict,
            "feature_ids": [f.id for f in self._all_features],
            "all_features": {k: f.to_dict() for k, f in all_features_dict.items()},
        }

    @staticmethod
    def feature_to_keys(feature: LiteFeature) -> List[str]:
        """
        Generate hashing keys from LiteFeature. For example:
        - LiteFeature("f1", Double, {"numeric"}) -> ['Double', 'numeric', 'Double,numeric', 'ANY']
        - LiteFeature("f1", Datetime, {"time_index"}) -> ['Datetime', 'time_index', 'Datetime,time_index', 'ANY']
        - LiteFeature("f1", Double, {"index", "other"}) -> ['Double', 'index', 'other', 'Double,index', 'Double,other', 'ANY']

                Args:
            feature (LiteFeature):

        Returns:
            List[str]
                List of hashing keys
        """
        keys: List[str] = []
        logical_type = feature.logical_type
        logical_type_name = None
        if logical_type is not None:
            logical_type_name = logical_type.__name__
            keys.append(logical_type_name)

        all_tags = sorted(feature.tags)

        tag_combinations = []

        # generate combinations of all lengths from 1 to the length of the input list
        for i in range(1, len(all_tags) + 1):
            # generate combinations of length i and append to the combinations_list
            for comb in combinations(all_tags, i):
                tag_combinations.append(list(comb))

        for tag_combination in tag_combinations:
            tags_key = ",".join(tag_combination)
            keys.append(tags_key)
            if logical_type_name:
                keys.append(f"{logical_type_name},{tags_key}")

        keys.append(ANY)
        return keys

    @staticmethod
    def from_dict(input_dict):
        primitive_deserializer = PrimitivesDeserializer()

        primitives = {}
        for prim_key, prim_dict in input_dict["primitives"].items():
            primitive = primitive_deserializer.deserialize_primitive(
                prim_dict,
            )
            assert isinstance(primitive, PrimitiveBase)
            primitives[prim_key] = primitive

        hydrated_features: Dict[str, LiteFeature] = {}

        feature_ids: List[str] = cast(List[str], input_dict["feature_ids"])
        all_features: Dict[str, Any] = cast(Dict[str, Any], input_dict["all_features"])

        def hydrate_feature(feature_id: str) -> LiteFeature:
            if feature_id in hydrated_features:
                return hydrated_features[feature_id]

            feature_dict = all_features[feature_id]
            base_features = [hydrate_feature(x) for x in feature_dict["base_features"]]

            logical_type = (
                logical_types_map[feature_dict["logical_type"]]
                if feature_dict["logical_type"]
                else None
            )

            hydrated_feature = LiteFeature(
                name=feature_dict["name"],
                logical_type=logical_type,
                tags=set(feature_dict["tags"]),
                primitive=primitives[feature_dict["primitive"]]
                if feature_dict["primitive"]
                else None,
                base_features=base_features,
                df_id=feature_dict["df_id"],
                related_features=set(),
                idx=feature_dict["idx"],
            )

            assert hydrated_feature.id == feature_dict["id"] == feature_id
            hydrated_features[feature_id] = hydrated_feature

            # need to link after features are stored on cache
            related_features = [
                hydrate_feature(x) for x in feature_dict["related_features"]
            ]
            hydrated_feature.related_features = set(related_features)

            return hydrated_feature

        return FeatureCollection([hydrate_feature(x) for x in feature_ids])


================================================
FILE: featuretools/feature_discovery/LiteFeature.py
================================================
from __future__ import annotations

import hashlib
from dataclasses import field
from functools import total_ordering
from typing import Any, Dict, List, Optional, Set, Type, Union

from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import LogicalType

from featuretools.feature_discovery.utils import (
    get_primitive_return_type,
    hash_primitive,
)
from featuretools.primitives.base.primitive_base import PrimitiveBase


@total_ordering
class LiteFeature:
    _name: Optional[str] = None
    _alias: Optional[str] = None

    _logical_type: Optional[Type[LogicalType]] = None
    _tags: Set[str] = field(default_factory=set)
    _primitive: Optional[PrimitiveBase] = None
    _base_features: List[LiteFeature] = field(default_factory=list)
    _df_id: Optional[str] = None

    _id: str
    _n_output_features: int = 1

    _depth = 0
    _related_features: Set[LiteFeature]
    _idx: int = 0

    def __init__(
        self,
        name: Optional[str] = None,
        logical_type: Optional[Type[LogicalType]] = None,
        tags: Optional[Set[str]] = None,
        primitive: Optional[PrimitiveBase] = None,
        base_features: Optional[List[LiteFeature]] = None,
        df_id: Optional[str] = None,
        related_features: Optional[Set[LiteFeature]] = None,
        idx: Optional[int] = None,
    ):
        self._logical_type = logical_type
        self._tags = tags if tags else set()
        self._primitive = primitive
        self._base_features = base_features if base_features else []
        self._df_id = df_id
        self._idx = idx if idx is not None else 0
        self._related_features = related_features if related_features else set()

        if self._primitive:
            if not isinstance(self._primitive, PrimitiveBase):
                raise ValueError("primitive input must be of type PrimitiveBase")

            if len(self.base_features) == 0:
                raise ValueError("there must be base features if given a primitive")

            if self._primitive.commutative:
                self._base_features = sorted(self._base_features)

            self._n_output_features = self._primitive.number_output_features
            self._depth = max([x.depth for x in self.base_features]) + 1

            if name:
                self._alias = name

            self._name = self._primitive.generate_name(
                [x.name for x in self.base_features],
            )

            return_column_schema = get_primitive_return_type(self._primitive)
            self._logical_type = (
                type(return_column_schema.logical_type)
                if return_column_schema.logical_type
                else None
            )

            self._tags = return_column_schema.semantic_tags

        else:
            if name is None:
                raise TypeError("Name must be given if origin feature")

            if self._logical_type is None:
                raise TypeError("Logical Type must be given if origin feature")

            self._name = name

        if self._logical_type is not None and "index" not in self._tags:
            self._tags = self._tags | self._logical_type.standard_tags

        self._id = self._generate_hash()

    @property
    def name(self):
        if self._alias:
            return self._alias
        elif self.is_multioutput():
            return f"{self._name}[{self.idx}]"
        return self._name

    @name.setter
    def name(self, _):
        raise AttributeError("name is immutable")

    def set_alias(self, value: Union[str, None]):
        self._alias = value

    @property
    def non_indexed_name(self):
        if not self.is_multioutput():
            raise ValueError("only used on multioutput features")
        return self._name

    @property
    def logical_type(self):
        return self._logical_type

    @logical_type.setter
    def logical_type(self, _):
        raise AttributeError("logical_type is immutable")

    @property
    def tags(self):
        return self._tags.copy()

    @tags.setter
    def tags(self, _):
        raise AttributeError("tags is immutable")

    @property
    def primitive(self):
        return self._primitive

    @primitive.setter
    def primitive(self, _):
        raise AttributeError("primitive is immutable")

    @property
    def base_features(self):
        return self._base_features

    @base_features.setter
    def base_features(self, _):
        raise AttributeError("base_features are immutable")

    @property
    def df_id(self):
        return self._df_id

    @df_id.setter
    def df_id(self, _):
        raise AttributeError("df_id is immutable")

    @property
    def id(self):
        return self._id

    @id.setter
    def id(self, _):
        raise AttributeError("id is immutable")

    @property
    def n_output_features(self):
        return self._n_output_features

    @n_output_features.setter
    def n_output_features(self, _):
        raise AttributeError("n_output_features is immutable")

    @property
    def depth(self):
        return self._depth

    @depth.setter
    def depth(self, _):
        raise AttributeError("depth is immutable")

    @property
    def related_features(self):
        return self._related_features.copy()

    @related_features.setter
    def related_features(self, value: Set[LiteFeature]):
        self._related_features = value

    @property
    def idx(self):
        return self._idx

    @idx.setter
    def idx(self, _):
        raise AttributeError("idx is immutable")

    @staticmethod
    def hash(
        name: Optional[str],
        primitive: Optional[PrimitiveBase] = None,
        base_features: List[LiteFeature] = [],
        df_id: Optional[str] = None,
        idx: int = 0,
    ):
        hash_msg = hashlib.sha256()

        if primitive:
            # TODO: hashing should be on primitive
            hash_msg.update(hash_primitive(primitive)[0].encode("utf-8"))
            commutative = primitive.commutative
            assert (
                len(base_features) > 0
            ), "there must be base features if give a primitive"
            base_columns = base_features
            if commutative:
                base_features.sort()

            for c in base_columns:
                hash_msg.update(c.id.encode("utf-8"))

        else:
            assert name
            hash_msg.update(name.encode("utf-8"))
            if df_id:
                hash_msg.update(df_id.encode("utf-8"))

        hash_msg.update(str(idx).encode("utf-8"))

        return hash_msg.hexdigest()

    def __eq__(self, other: LiteFeature):
        return self._id == other._id

    def __lt__(self, other: LiteFeature):
        return self._id < other._id

    def __ne__(self, other):
        return self._id != other._id

    def __hash__(self):
        return hash(self._id)

    def _generate_hash(self) -> str:
        return self.hash(
            name=self._name,
            primitive=self._primitive,
            base_features=self._base_features,
            df_id=self._df_id,
            idx=self._idx,
        )

    def get_primitive_name(self) -> Union[str, None]:
        return self._primitive.name if self._primitive else None

    def get_dependencies(self, deep=False) -> List[LiteFeature]:
        flattened_dependencies = []
        for f in self._base_features:
            flattened_dependencies.append(f)

            if deep:
                dependencies = f.get_dependencies()
                if isinstance(dependencies, list):
                    flattened_dependencies.extend(dependencies)
                else:
                    flattened_dependencies.append(dependencies)
        return flattened_dependencies

    def get_origin_features(self) -> List[LiteFeature]:
        all_dependencies = self.get_dependencies(deep=True)
        return [f for f in all_dependencies if f._depth == 0]

    @property
    def column_schema(self) -> ColumnSchema:
        return ColumnSchema(logical_type=self.logical_type, semantic_tags=self.tags)

    def dependent_primitives(self) -> Set[Type[PrimitiveBase]]:
        dependent_features = self.get_dependencies(deep=True)
        dependent_primitives = {
            type(f._primitive) for f in dependent_features if f._primitive
        }
        if self._primitive:
            dependent_primitives.add(type(self._primitive))
        return dependent_primitives

    def to_dict(self) -> Dict[str, Any]:
        return {
            "name": self.name,
            "logical_type": self.logical_type.__name__ if self.logical_type else None,
            "tags": list(self.tags),
            "primitive": hash_primitive(self.primitive)[0] if self.primitive else None,
            "base_features": [x.id for x in self.base_features],
            "df_id": self.df_id,
            "id": self.id,
            "related_features": [x.id for x in self.related_features],
            "idx": self.idx,
        }

    def is_multioutput(self) -> bool:
        return len(self._related_features) > 0

    def copy(self) -> LiteFeature:
        copied_feature = LiteFeature(
            name=self._name,
            logical_type=self._logical_type,
            tags=self._tags.copy(),
            primitive=self._primitive,
            base_features=[f.copy() for f in self._base_features],
            df_id=self._df_id,
            idx=self._idx,
            related_features=self._related_features.copy(),
        )

        copied_feature.set_alias(self._alias)

        return copied_feature

    def __repr__(self) -> str:
        name = f"name='{self.name}'"
        logical_type = f"logical_type={self.logical_type}"
        tags = f"tags={self.tags}"
        primitive = f"primitive={self.get_primitive_name()}"
        return f"LiteFeature({name}, {logical_type}, {tags}, {primitive})"


================================================
FILE: featuretools/feature_discovery/__init__.py
================================================


================================================
FILE: featuretools/feature_discovery/convertors.py
================================================
from __future__ import annotations

from typing import Dict, List

import pandas as pd
from woodwork.logical_types import LogicalType

from featuretools.feature_base.feature_base import (
    FeatureBase,
    IdentityFeature,
    TransformFeature,
)
from featuretools.feature_discovery.LiteFeature import LiteFeature
from featuretools.primitives import TransformPrimitive
from featuretools.primitives.base.primitive_base import PrimitiveBase

FeatureCache = Dict[str, FeatureBase]


def convert_featurebase_list_to_feature_list(
    featurebase_list: List[FeatureBase],
) -> List[LiteFeature]:
    """
    Convert a List of FeatureBase objects to a list LiteFeature objects

    Args:
        featurebase_list (List[FeatureBase]):

    Returns:
       LiteFeatures (List[LiteFeature]) - converted LiteFeature objects
    """

    def rfunc(fb: FeatureBase) -> List[LiteFeature]:
        base_features = [
            feature
            for feature_list in [rfunc(x) for x in fb.base_features]
            for feature in feature_list
        ]
        col_schema = fb.column_schema

        logical_type = col_schema.logical_type
        if logical_type is not None:
            assert issubclass(type(logical_type), LogicalType)
            logical_type = type(logical_type)

        tags = col_schema.semantic_tags

        if isinstance(fb, IdentityFeature):
            primitive = None
        else:
            primitive = fb.primitive
            assert isinstance(primitive, PrimitiveBase)

        if fb.number_output_features > 1:
            features: List[LiteFeature] = []

            for idx, name in enumerate(fb.get_feature_names()):
                f = LiteFeature(
                    name=name,
                    logical_type=logical_type,
                    tags=tags,
                    primitive=primitive,
                    base_features=base_features,
                    # TODO: use when working with multi-table
                    df_id=None,
                    idx=idx,
                )
                features.append(f)

            for feature in features:
                related_features = [f for f in features if f.id != feature.id]
                feature.related_features = set(related_features)

            return features

        return [
            LiteFeature(
                name=fb.get_name(),
                logical_type=logical_type,
                tags=tags,
                primitive=primitive,
                base_features=base_features,
                # TODO: use when working with multi-table
                df_id=None,
            ),
        ]

    return [
        feature
        for feature_list in [rfunc(fb) for fb in featurebase_list]
        for feature in feature_list
    ]


def _feature_to_transform_feature(
    feature: LiteFeature,
    base_features: List[FeatureBase],
) -> FeatureBase:
    """
    Transform LiteFeature into FeatureBase object. Handles the Multi-output
    feature in correct way.

    Args:
        feature (LiteFeature)
        base_features (List[FeatureBase])

    Returns:
       FeatureBase
    """
    assert feature.primitive

    assert isinstance(
        feature.primitive,
        TransformPrimitive,
    ), "Only Transform Primitives"

    fb = TransformFeature(base_features, feature.primitive)
    if feature.is_multioutput():
        sorted_features = sorted(
            [f for f in feature.related_features] + [feature],
            key=lambda x: x.idx,
        )
        names = [x.name for x in sorted_features]

        fb = fb.rename(feature.non_indexed_name)
        fb.set_feature_names(names)
    else:
        fb = fb.rename(feature.name)

    return fb


def _convert_feature_to_featurebase(
    feature: LiteFeature,
    dataframe: pd.DataFrame,
    cache: FeatureCache,
) -> FeatureBase:
    """
    Recursively transforms a LiteFeature object into a Featurebase object

    Args:
        feature (LiteFeature)
        base_features (List[FeatureBase])
        cache (FeatureCache) already converted features

    Returns:
       FeatureBase
    """

    def get_base_features(
        feature: LiteFeature,
    ) -> List[FeatureBase]:
        new_base_features: List[FeatureBase] = []
        for bf in feature.base_features:
            fb = rfunc(bf)
            if bf.is_multioutput():
                idx = bf.idx
                # if its multioutput, you can index on the FeatureBase
                new_base_features.append(fb[idx])
            else:
                new_base_features.append(fb)

        return new_base_features

    def rfunc(feature: LiteFeature) -> FeatureBase:
        # if feature has already been converted, return from cache
        if feature.id in cache:
            return cache[feature.id]

        # if depth is 0, we are at an origin feature
        if feature.depth == 0:
            fb = IdentityFeature(dataframe.ww[feature.name])
            cache[feature.id] = fb
            return fb

        base_features = get_base_features(feature)

        fb = _feature_to_transform_feature(feature, base_features)
        cache[feature.id] = fb
        return fb

    return rfunc(feature)


def convert_feature_list_to_featurebase_list(
    feature_list: List[LiteFeature],
    dataframe: pd.DataFrame,
) -> List[FeatureBase]:
    """
    Convert a list of LiteFeature objects into a list of FeatureBase objects

    Args:
        feature_list (List[LiteFeature])
        dataframe (pd.DataFrame)

    Returns:
       List[FeatureBase]
    """
    feature_cache: FeatureCache = {}

    converted_features: List[FeatureBase] = []
    for feature in feature_list:
        if feature.is_multioutput():
            related_feature_ids = [f.id for f in feature.related_features]
            if any((x in feature_cache for x in related_feature_ids)):
                # feature base already created for related ids
                continue

        fb = _convert_feature_to_featurebase(
            feature=feature,
            dataframe=dataframe,
            cache=feature_cache,
        )
        converted_features.append(fb)

    return converted_features


================================================
FILE: featuretools/feature_discovery/feature_discovery.py
================================================
import inspect
from collections import defaultdict
from itertools import combinations, permutations, product
from typing import Iterable, List, Set, Tuple, Type, Union, cast

from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import LogicalType
from woodwork.table_schema import TableSchema

from featuretools.feature_discovery.FeatureCollection import FeatureCollection
from featuretools.feature_discovery.LiteFeature import LiteFeature
from featuretools.feature_discovery.utils import column_schema_to_keys, flatten_list
from featuretools.primitives.base.primitive_base import PrimitiveBase


def _index_column_set(column_set: List[ColumnSchema]) -> List[Tuple[str, int]]:
    """
    Indexes input set to find types of columns and the quantity of each

    Args:
        column_set (List(ColumnSchema)):
            List of Column types needed by associated primitive.

    Returns:
        List[Tuple[str, int]]
            A list of key, count tuples

    Examples:
        .. code-block:: python

            from featuretools.feature_discovery.feature_discovery import _index_column_set
            from woodwork.column_schema import ColumnSchema

            column_set = [ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"})]
            indexed_column_set = _index_column_set(column_set)
            [("numeric": 2)]
    """
    out = defaultdict(int)
    for column_schema in column_set:
        key = column_schema_to_keys(column_schema)
        out[key] += 1
    return list(out.items())


def _get_features(
    feature_collection: FeatureCollection,
    column_keys: Tuple[Tuple[str, int]],
    commutative: bool,
) -> List[List[LiteFeature]]:
    """
    Calculates all LiteFeature combinations using the given hashmap of existing features, and the input set of required columns.

    Args:
        feature_collection (FeatureCollection):
            An indexed feature collection object for efficient querying of features
        column_keys (List[Tuple[str, int]]):
            List of Column types needed by associated primitive.
        commutative (bool):
            whether or not we need to use product or combinations to create feature sets.

    Returns:
        List[List[LiteFeature]]
            A list of LiteFeature sets.

    Examples:
        .. code-block:: python

            from featuretools.feature_discovery.feature_discovery import _get_features
            from woodwork.column_schema import ColumnSchema

            feature_groups = {
                "ANY": ["f1", "f2", "f3"],
                "Double": ["f1", "f2", "f3"],
                "numeric": ["f1", "f2", "f3"],
                "Double,numeric": ["f1", "f2", "f3"],
            }
            column_set = [ColumnSchema(semantic_tags={"numeric"}), ColumnSchema(semantic_tags={"numeric"})]
            features = _get_features(col_groups, column_set, commutative=False)
    """

    prod_iter = []
    for key, count in column_keys:
        relevant_features = list(feature_collection.get_by_key(key))

        if commutative:
            prod_iter.append(combinations(relevant_features, count))
        else:
            prod_iter.append(permutations(relevant_features, count))

    feature_combinations = product(*prod_iter)

    return [flatten_list(x) for x in feature_combinations]


def _primitive_to_columnsets(primitive: PrimitiveBase) -> List[List[ColumnSchema]]:
    column_sets = primitive.input_types
    assert column_sets is not None
    if not isinstance(column_sets[0], list):
        column_sets = [primitive.input_types]

    column_sets = cast(List[List[ColumnSchema]], column_sets)

    # Some primitives are commutative, yet have explicit versions of commutative pairs (eg. MultiplyNumericBoolean),
    # which would create multiple versions, so this resolved that.
    if primitive.commutative:
        existing = set()
        uniq_column_sets = []
        for column_set in column_sets:
            key = "_".join(sorted([x.__repr__() for x in column_set]))
            if key not in existing:
                uniq_column_sets.append(column_set)
                existing.add(key)

        column_sets = uniq_column_sets

    return column_sets


def _get_matching_features(
    feature_collection: FeatureCollection,
    primitive: PrimitiveBase,
) -> List[List[LiteFeature]]:
    """
    For a given primitive, find all feature sets that can be used to create new feature

    Args:
        feature_collection (FeatureCollection):
            An indexed feature collection object for efficient querying of features
        primitive (PrimitiveBase)

    Returns:
        List[List[LiteFeature]]
            List of feature sets

    Examples:
        .. code-block:: python

            from featuretools.feature_discovery.feature_discovery import get_matching_columns
            from woodwork.column_schema import ColumnSchema

            feature_groups = {
                "ANY": ["f1", "f2", "f3"],
                "Double": ["f1", "f2", "f3"],
                "numeric": ["f1", "f2", "f3"],
                "Double,numeric": ["f1", "f2", "f3"],
            }

            feature_sets = _get_matching_features(col_groups, AddNumeric)

            [
                ["f1", "f2"],
                ["f1", "f3"],
                ["f2", "f3"]
            ]
    """
    column_sets = _primitive_to_columnsets(primitive=primitive)

    column_keys_set = [_index_column_set(c) for c in column_sets]

    commutative = primitive.commutative

    feature_sets = []
    for column_keys in column_keys_set:
        assert column_keys is not None
        feature_sets_ = _get_features(
            feature_collection=feature_collection,
            column_keys=tuple(column_keys),
            commutative=commutative,
        )

        feature_sets.extend(feature_sets_)

    return feature_sets


def _features_from_primitive(
    primitive: PrimitiveBase,
    feature_collection: FeatureCollection,
) -> List[LiteFeature]:
    """
    For a given primitive, creates all engineered features

    Args:
        primitive (Type[PrimitiveBase])
        feature_collection (FeatureCollection):
            An indexed feature collection object for efficient querying of features

    Returns:
        List[List[LiteFeature]]
            List of feature sets

    Examples:
        .. code-block:: python

            from featuretools.feature_discovery.feature_discovery import get_matching_columns
            from woodwork.column_schema import ColumnSchema

            feature_groups = {
                "ANY": ["f1", "f2", "f3"],
                "Double": ["f1", "f2", "f3"],
                "numeric": ["f1", "f2", "f3"],
                "Double,numeric": ["f1", "f2", "f3"],
            }

            feature_sets = _features_from_primitive(AddNumeric, feature_groups)

            [
                ["f1", "f2"],
                ["f1", "f3"],
                ["f2", "f3"]
            ]
    """
    assert isinstance(primitive, PrimitiveBase)

    features: List[LiteFeature] = []
    feature_sets = _get_matching_features(
        feature_collection=feature_collection,
        primitive=primitive,
    )
    for feature_set in feature_sets:
        if primitive.number_output_features > 1:
            related_features: Set[LiteFeature] = set()
            for n in range(primitive.number_output_features):
                feature = LiteFeature(
                    primitive=primitive,
                    base_features=feature_set,
                    idx=n,
                )

                related_features.add(feature)

            for f in related_features:
                f.related_features = related_features - {f}
                features.append(f)
        else:
            features.append(
                LiteFeature(
                    primitive=primitive,
                    base_features=feature_set,
                ),
            )
    return features


def schema_to_features(schema: TableSchema) -> List[LiteFeature]:
    """
    ** EXPERIMENTAL **
    Convert a Woodwork Schema object to a list of LiteFeatures.

    Args:
        schema (TableSchema):
            Woodwork TableSchema object

    Returns:
        List[LiteFeature]

    Examples:
        .. code-block:: python

            from featuretools.feature_discovery.feature_discovery import schema_to_features
            from featuretools.primitives import Absolute, IsNull
            import pandas as pd
            import woodwork as ww

            df = pd.DataFrame({
                "idx": [0,1,2,3],
                "f1": ["A", "B", "C", "D"],
                "f2": [1.2, 2.3, 3.4, 4.5]
            })

            df.ww.init()

            features = schema_to_features(df.ww.schema)

    """
    features = []
    for col_name, column_schema in schema.columns.items():
        assert isinstance(column_schema, ColumnSchema)

        logical_type = column_schema.logical_type
        assert logical_type
        assert issubclass(type(logical_type), LogicalType)

        tags = column_schema.semantic_tags
        assert isinstance(tags, set)

        features.append(
            LiteFeature(
                name=col_name,
                logical_type=type(logical_type),
                tags=tags,
            ),
        )

    return features


def _check_inputs(
    input_features: Iterable[LiteFeature],
    primitives: Union[List[Type[PrimitiveBase]], List[PrimitiveBase]],
) -> Tuple[Iterable[LiteFeature], List[PrimitiveBase]]:
    if not isinstance(input_features, Iterable):
        raise ValueError("input_features must be an iterable of LiteFeature objects")

    for feature in input_features:
        if not isinstance(feature, LiteFeature):
            raise ValueError(
                "input_features must be an iterable of LiteFeature objects",
            )

    if not isinstance(primitives, List):
        raise ValueError(
            "primitives must be a list of Primitive classes or Primitive instances",
        )

    primitive_instances: List[PrimitiveBase] = []
    for primitive in primitives:
        if inspect.isclass(primitive) and issubclass(primitive, PrimitiveBase):
            primitive_instances.append(primitive())
        elif isinstance(primitive, PrimitiveBase):
            primitive_instances.append(primitive)
        else:
            raise ValueError(
                "primitives must be a list of Primitive classes or Primitive instances",
            )

    return (input_features, primitive_instances)


def generate_features_from_primitives(
    input_features: Iterable[LiteFeature],
    primitives: Union[List[Type[PrimitiveBase]], List[PrimitiveBase]],
) -> List[LiteFeature]:
    """
    ** EXPERIMENTAL **
    Calculates all Features for a given input of features and a list of primitives.

    Args:
        origin_features (List[LiteFeature]):
            List of origin features
        primitives (List[Type[PrimitiveBase]])
            List of primitive classes

    Returns:
        List[LiteFeature]

    Examples:
        .. code-block:: python

            from featuretools.feature_discovery.feature_discovery import lite_dfs
            from featuretools.primitives import Absolute, IsNull
            import pandas as pd
            import woodwork as ww

            df = pd.DataFrame({
                "idx": [0,1,2,3],
                "f1": ["A", "B", "C", "D"],
                "f2": [1.2, 2.3, 3.4, 4.5]
            })

            df.ww.init()
            origin_features = schema_to_features(df.ww.schema)
            features = lite_dfs(origin_features, [Absolute, IsNull])

    """

    (input_features, primitives) = _check_inputs(input_features, primitives)

    features = [x.copy() for x in input_features]

    feature_collection = FeatureCollection(features=features)
    feature_collection.reindex()

    for primitive in primitives:
        features_ = _features_from_primitive(
            primitive=primitive,
            feature_collection=feature_collection,
        )
        features.extend(features_)

    return features


================================================
FILE: featuretools/feature_discovery/type_defs.py
================================================
ANY = "ANY"


================================================
FILE: featuretools/feature_discovery/utils.py
================================================
import hashlib
import json
from functools import lru_cache
from typing import Any, Dict, Tuple

from woodwork.column_schema import ColumnSchema

from featuretools.feature_discovery.type_defs import ANY
from featuretools.primitives.base.primitive_base import PrimitiveBase
from featuretools.primitives.utils import (
    get_all_logical_type_names,
    get_all_primitives,
    serialize_primitive,
)

primitives_map = get_all_primitives()
logical_types_map = get_all_logical_type_names()


def column_schema_to_keys(column_schema: ColumnSchema) -> str:
    """
    Generate a hashing key from a Columns Schema. For example:
    - ColumnSchema(logical_type=Double) -> "Double"
    - ColumnSchema(semantic_tags={"index"}) -> "index"
    - ColumnSchema(logical_type=Double, semantic_tags={"index", "other"}) -> "Double,index,other"

    Args:
        column_schema (ColumnSchema):

    Returns:
        str: hashing key
    """
    logical_type = column_schema.logical_type
    tags = column_schema.semantic_tags
    lt_key = None
    if logical_type:
        lt_key = type(logical_type).__name__

    tags = sorted(tags)
    if len(tags) > 0:
        tag_key = ",".join(tags)
        return f"{lt_key},{tag_key}" if lt_key is not None else tag_key

    elif lt_key is not None:
        return lt_key
    else:
        return ANY


@lru_cache(maxsize=None)
def hash_primitive(primitive: PrimitiveBase) -> Tuple[str, Dict[str, Any]]:
    hash_msg = hashlib.sha256()
    primitive_name = primitive.name
    assert isinstance(primitive_name, str)
    primitive_dict = serialize_primitive(primitive)
    primitive_json = json.dumps(primitive_dict).encode("utf-8")
    hash_msg.update(primitive_json)
    key = hash_msg.hexdigest()
    return (key, primitive_dict)


def get_primitive_return_type(primitive: PrimitiveBase) -> ColumnSchema:
    """
    Get Return type from a primitive

    Args:
        primitive (PrimitiveBase)

    Returns:
        ColumnSchema
    """
    if primitive.return_type:
        return primitive.return_type
    return_type = primitive.input_types[0]
    if isinstance(return_type, list):
        return_type = return_type[0]
    return return_type


def flatten_list(nested_list):
    return [item for sublist in nested_list for item in sublist]


================================================
FILE: featuretools/primitives/__init__.py
================================================
# flake8: noqa
import inspect
import logging
import traceback

import pkg_resources

from featuretools.primitives.standard import *
from featuretools.primitives.utils import (
    get_aggregation_primitives,
    get_default_aggregation_primitives,
    get_default_transform_primitives,
    get_transform_primitives,
    list_primitives,
    summarize_primitives,
)


def _load_primitives():
    """Load in a list of primitives registered by other libraries into Featuretools.

    Example entry_points definition for a library using this entry point either in:

        - setup.py:

            setup(
                entry_points={
                    'featuretools_primitives': [
                        'other_library = other_library',
                    ],
                },
            )

        - setup.cfg:

            [options.entry_points]
            featuretools_primitives =
                other_library = other_library

        - pyproject.toml:

            [project.entry-points."featuretools_primitives"]
            other_library = "other_library"

    where `other_library` is a top-level module containing all the primitives.
    """
    logger = logging.getLogger("featuretools")
    base_primitives = AggregationPrimitive, TransformPrimitive  # noqa: F405

    for entry_point in pkg_resources.iter_entry_points("featuretools_primitives"):
        try:
            loaded = entry_point.load()
        except Exception:
            message = f'Featuretools failed to load "{entry_point.name}" primitives from "{entry_point.module_name}". '
            message += "For a full stack trace, set logging to debug."
            logger.warning(message)
            logger.debug(traceback.format_exc())
            continue

        for key in dir(loaded):
            primitive = getattr(loaded, key, None)

            if (
                inspect.isclass(primitive)
                and issubclass(primitive, base_primitives)
                and primitive not in base_primitives
            ):
                name = primitive.__name__
                scope = globals()

                if name in scope:
                    this_module, that_module = (
                        primitive.__module__,
                        scope[name].__module__,
                    )
                    message = f'While loading primitives via "{entry_point.name}" entry point, '
                    message += (
                        f'ignored primitive "{name}" from "{this_module}" because '
                    )
                    message += (
                        f'a primitive with that name already exists in "{that_module}"'
                    )
                    logger.warning(message)
                else:
                    scope[name] = primitive


_load_primitives()


================================================
FILE: featuretools/primitives/base/__init__.py
================================================
from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive
from featuretools.primitives.base.primitive_base import PrimitiveBase
from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


================================================
FILE: featuretools/primitives/base/aggregation_primitive_base.py
================================================
from featuretools.primitives.base.primitive_base import PrimitiveBase


class AggregationPrimitive(PrimitiveBase):
    def generate_name(
        self,
        base_feature_names,
        relationship_path_name,
        parent_dataframe_name,
        where_str,
        use_prev_str,
    ):
        base_features_str = ", ".join(base_feature_names)
        return "%s(%s.%s%s%s%s)" % (
            self.name.upper(),
            relationship_path_name,
            base_features_str,
            where_str,
            use_prev_str,
            self.get_args_string(),
        )

    def generate_names(
        self,
        base_feature_names,
        relationship_path_name,
        parent_dataframe_name,
        where_str,
        use_prev_str,
    ):
        n = self.number_output_features
        base_name = self.generate_name(
            base_feature_names,
            relationship_path_name,
            parent_dataframe_name,
            where_str,
            use_prev_str,
        )
        return [base_name + "[%s]" % i for i in range(n)]


================================================
FILE: featuretools/primitives/base/primitive_base.py
================================================
import os
from inspect import signature

import numpy as np
import pandas as pd

from featuretools import config
from featuretools.utils.description_utils import convert_to_nth


class PrimitiveBase(object):
    """Base class for all primitives."""

    #: (str): Name of the primitive
    name = None
    #: (list): woodwork.ColumnSchema types of inputs
    input_types = None
    #: (woodwork.ColumnSchema): ColumnSchema type of return
    return_type = None
    #: Default value this feature returns if no data found. Defaults to np.nan
    default_value = np.nan
    #: (bool): True if feature needs to know what the current calculation time
    # is (provided to computational backend as "time_last")
    uses_calc_time = False
    #: (int): Maximum number of features in the largest chain proceeding
    # downward from this feature's base features.
    max_stack_depth = None
    #: (int): Number of columns in feature matrix associated with this feature
    number_output_features = 1
    # whitelist of primitives can have this primitive in input_types
    base_of = None
    # blacklist of primitives can have this primitive in input_types
    base_of_exclude = None
    # whitelist of primitives that can be in input_types
    stack_on = None
    # blacklist of primitives that can be in signature
    stack_on_exclude = None
    # determines if primitive can be in input_types for self
    stack_on_self = True
    # (bool) If True will only make one feature per unique set of base features
    commutative = False
    #: (str, list[str]): description template of the primitive. Input column
    # descriptions are passed as positional arguments to the template. Slice
    # number (if present) in "nth" form is passed to the template via the
    # `nth_slice` keyword argument. Multi-output primitives can use a list to
    # differentiate between the base description and a slice description.
    description_template = None

    def __init__(self):
        pass

    def __call__(self, *args, **kwargs):
        series_args = [pd.Series(arg) for arg in args]
        try:
            return self._method(*series_args, **kwargs)
        except AttributeError:
            self._method = self.get_function()
            return self._method(*series_args, **kwargs)

    def __lt__(self, other):
        return (self.name + self.get_args_string()) < (
            other.name + other.get_args_string()
        )

    def generate_name(self):
        raise NotImplementedError("Subclass must implement")

    def generate_names(self):
        raise NotImplementedError("Subclass must implement")

    def get_function(self):
        raise NotImplementedError("Subclass must implement")

    def get_filepath(self, filename):
        return os.path.join(config.get("primitive_data_folder"), filename)

    def get_args_string(self):
        strings = []
        for name, value in self.get_arguments():
            # format arg to string
            string = "{}={}".format(name, str(value))
            strings.append(string)

        if len(strings) == 0:
            return ""

        string = ", ".join(strings)
        string = ", " + string
        return string

    def get_arguments(self):
        values = []

        args = signature(self.__class__).parameters.items()
        for name, arg in args:
            # assert that arg is attribute of primitive
            error = '"{}" must be attribute of {}'
            assert hasattr(self, name), error.format(name, self.__class__.__name__)

            value = getattr(self, name)
            # check if args are the same type
            if isinstance(value, type(arg.default)):
                # skip if default value
                if arg.default == value:
                    continue

            values.append((name, value))

        return values

    def get_description(
        self,
        input_column_descriptions,
        slice_num=None,
        template_override=None,
    ):
        template = template_override or self.description_template
        if template:
            if isinstance(template, list):
                if slice_num is not None:
                    slice_index = slice_num + 1
                    if slice_index < len(template):
                        return template[slice_index].format(
                            *input_column_descriptions,
                            nth_slice=convert_to_nth(slice_index),
                        )
                    else:
                        if len(template) > 2:
                            raise IndexError("Slice out of range of template")
                        return template[1].format(
                            *input_column_descriptions,
                            nth_slice=convert_to_nth(slice_index),
                        )
                else:
                    template = template[0]
            return template.format(*input_column_descriptions)

        # generic case:
        name = self.name.upper() if self.name is not None else type(self).__name__
        if slice_num is not None:
            nth_slice = convert_to_nth(slice_num + 1)
            description = "the {} output from applying {} to {}".format(
                nth_slice,
                name,
                ", ".join(input_column_descriptions),
            )
        else:
            description = "the result of applying {} to {}".format(
                name,
                ", ".join(input_column_descriptions),
            )
        return description

    @staticmethod
    def flatten_nested_input_types(input_types):
        """Flattens nested column schema inputs into a single list."""
        if isinstance(input_types[0], list):
            input_types = [
                sub_input for input_obj in input_types for sub_input in input_obj
            ]
        return input_types


================================================
FILE: featuretools/primitives/base/transform_primitive_base.py
================================================
from featuretools.primitives.base.primitive_base import PrimitiveBase


class TransformPrimitive(PrimitiveBase):
    """Feature for dataframe that is a based off one or more other features
    in that dataframe."""

    # (bool) If True, feature function depends on all values of dataframe
    #   (and will receive these values as input, regardless of specified instance ids)
    uses_full_dataframe = False

    def generate_name(self, base_feature_names):
        return "%s(%s%s)" % (
            self.name.upper(),
            ", ".join(base_feature_names),
            self.get_args_string(),
        )

    def generate_names(self, base_feature_names):
        n = self.number_output_features
        base_name = self.generate_name(base_feature_names)
        return [base_name + "[%s]" % i for i in range(n)]


================================================
FILE: featuretools/primitives/options_utils.py
================================================
import logging
import warnings
from itertools import permutations

from featuretools import primitives
from featuretools.feature_base import IdentityFeature

logger = logging.getLogger("featuretools")


def _get_primitive_options():
    # all possible option keys: function that verifies value type
    return {
        "ignore_dataframes": list_dataframe_check,
        "include_dataframes": list_dataframe_check,
        "ignore_columns": dict_to_list_column_check,
        "include_columns": dict_to_list_column_check,
        "ignore_groupby_dataframes": list_dataframe_check,
        "include_groupby_dataframes": list_dataframe_check,
        "ignore_groupby_columns": dict_to_list_column_check,
        "include_groupby_columns": dict_to_list_column_check,
    }


def dict_to_list_column_check(option, es):
    if not (
        isinstance(option, dict)
        and all([isinstance(option_val, list) for option_val in option.values()])
    ):
        return False
    else:
        for dataframe, columns in option.items():
            if dataframe not in es:
                warnings.warn("Dataframe '%s' not in entityset" % (dataframe))
            else:
                for invalid_col in [
                    column for column in columns if column not in es[dataframe]
                ]:
                    warnings.warn(
                        "Column '%s' not in dataframe '%s'" % (invalid_col, dataframe),
                    )
        return True


def list_dataframe_check(option, es):
    if not isinstance(option, list):
        return False
    else:
        for invalid_dataframe in [
            dataframe for dataframe in option if dataframe not in es
        ]:
            warnings.warn("Dataframe '%s' not in entityset" % (invalid_dataframe))
        return True


def generate_all_primitive_options(
    all_primitives,
    primitive_options,
    ignore_dataframes,
    ignore_columns,
    es,
):
    dataframe_dict = {
        dataframe.ww.name: [col for col in dataframe.columns]
        for dataframe in es.dataframes
    }

    primitive_options = _init_primitive_options(primitive_options, dataframe_dict)
    global_ignore_dataframes = ignore_dataframes
    global_ignore_columns = ignore_columns.copy()
    # for now, only use primitive names as option keys
    for primitive in all_primitives:
        if primitive in primitive_options and primitive.name in primitive_options:
            msg = (
                "Options present for primitive instance and generic "
                "primitive class (%s), primitive instance will not use generic "
                "options" % (primitive.name)
            )
            warnings.warn(msg)
        if primitive in primitive_options or primitive.name in primitive_options:
            options = primitive_options.get(
                primitive,
                primitive_options.get(primitive.name),
            )
            # Reconcile global options with individually-specified options
            included_dataframes = set().union(
                *[
                    option.get("include_dataframes", set()).union(
                        option.get("include_columns", {}).keys(),
                    )
                    for option in options
                ]
            )
            global_ignore_dataframes = global_ignore_dataframes.difference(
                included_dataframes,
            )
            for option in options:
                # don't globally ignore a column if it's included for a primitive
                if "include_columns" in option:
                    for dataframe, include_cols in option["include_columns"].items():
                        global_ignore_columns[dataframe] = global_ignore_columns[
                            dataframe
                        ].difference(include_cols)
                option["ignore_dataframes"] = option["ignore_dataframes"].union(
                    ignore_dataframes.difference(included_dataframes),
                )
            for dataframe, ignore_cols in ignore_columns.items():
                # if already ignoring columns for this dataframe, add globals
                for option in options:
                    if dataframe in option["ignore_columns"]:
                        option["ignore_columns"][dataframe] = option["ignore_columns"][
                            dataframe
                        ].union(ignore_cols)
                    # if no ignore_columns and dataframe is explicitly included, don't ignore the column
                    elif dataframe in included_dataframes:
                        continue
                    # Otherwise, keep the global option
                    else:
                        option["ignore_columns"][dataframe] = ignore_cols
        else:
            # no user specified options, just use global defaults
            primitive_options[primitive] = [
                {
                    "ignore_dataframes": ignore_dataframes,
                    "ignore_columns": ignore_columns,
                },
            ]
    return primitive_options, global_ignore_dataframes, global_ignore_columns


def _init_primitive_options(primitive_options, es):
    # Flatten all tuple keys, convert value lists into sets, check for
    # conflicting keys
    flattened_options = {}
    for primitive_keys, options in primitive_options.items():
        if not isinstance(primitive_keys, tuple):
            primitive_keys = (primitive_keys,)
        if isinstance(options, list):
            for primitive_key in primitive_keys:
                if isinstance(primitive_key, str):
                    primitive = primitives.get_aggregation_primitives().get(
                        primitive_key,
                    ) or primitives.get_transform_primitives().get(primitive_key)
                    if not primitive:
                        msg = "Unknown primitive with name '{}'".format(primitive_key)
                        raise ValueError(msg)
                else:
                    primitive = primitive_key
                assert (
                    len(primitive.input_types[0]) == len(options)
                    if isinstance(primitive.input_types[0], list)
                    else len(primitive.input_types) == len(options)
                ), (
                    "Number of options does not match number of inputs for primitive %s"
                    % (primitive_key)
                )
            options = [
                _init_option_dict(primitive_keys, option, es) for option in options
            ]
        else:
            options = [_init_option_dict(primitive_keys, options, es)]

        for primitive in primitive_keys:
            if isinstance(primitive, type):
                primitive = primitive.name

            # if primitive is specified more than once, raise error
            if primitive in flattened_options:
                raise KeyError("Multiple options found for primitive %s" % (primitive))

            flattened_options[primitive] = options
    return flattened_options


def _init_option_dict(key, option_dict, es):
    initialized_option_dict = {}
    primitive_options = _get_primitive_options()
    # verify all keys are valid and match expected type, convert lists to sets
    for option_key, option in option_dict.items():
        if option_key not in primitive_options:
            raise KeyError(
                "Unrecognized primitive option '%s' for %s"
                % (option_key, ",".join(key)),
            )
        if not primitive_options[option_key](option, es):
            raise TypeError(
                "Incorrect type formatting for '%s' for %s"
                % (option_key, ",".join(key)),
            )
        if isinstance(option, list):
            initialized_option_dict[option_key] = set(option)
        elif isinstance(option, dict):
            initialized_option_dict[option_key] = {
                key: set(option[key]) for key in option
            }
    # initialize ignore_dataframes and ignore_columns to empty sets if not present
    if "ignore_columns" not in initialized_option_dict:
        initialized_option_dict["ignore_columns"] = dict()
    if "ignore_dataframes" not in initialized_option_dict:
        initialized_option_dict["ignore_dataframes"] = set()
    return initialized_option_dict


def column_filter(f, options, groupby=False):
    if groupby and not f.column_schema.semantic_tags.intersection(
        {"category", "foreign_key"},
    ):
        return False
    include_cols = "include_groupby_columns" if groupby else "include_columns"
    ignore_cols = "ignore_groupby_columns" if groupby else "ignore_columns"
    include_dataframes = (
        "include_groupby_dataframes" if groupby else "include_dataframes"
    )
    ignore_dataframes = "ignore_groupby_dataframes" if groupby else "ignore_dataframes"

    dependencies = f.get_dependencies(deep=True) + [f]
    for base_f in dependencies:
        if isinstance(base_f, IdentityFeature):
            if (
                include_cols in options
                and base_f.dataframe_name in options[include_cols]
            ):
                if base_f.get_name() in options[include_cols][base_f.dataframe_name]:
                    continue  # this is a valid feature, go to next
                else:
                    return False  # this is not an included feature
            if ignore_cols in options and base_f.dataframe_name in options[ignore_cols]:
                if base_f.get_name() in options[ignore_cols][base_f.dataframe_name]:
                    return False  # ignore this feature
        if include_dataframes in options:
            return base_f.dataframe_name in options[include_dataframes]
        elif (
            ignore_dataframes in options
            and base_f.dataframe_name in options[ignore_dataframes]
        ):
            return False  # ignore the dataframe
    return True


def ignore_dataframe_for_primitive(options, dataframe, groupby=False):
    # This logic handles whether given options ignore an dataframe or not
    def should_ignore_dataframe(option):
        if groupby:
            if (
                "include_groupby_columns" not in option
                or dataframe.ww.name not in option["include_groupby_columns"]
            ):
                if (
                    "include_groupby_dataframes" in option
                    and dataframe.ww.name not in option["include_groupby_dataframes"]
                ):
                    return True
                elif (
                    "ignore_groupby_dataframes" in option
                    and dataframe.ww.name in option["ignore_groupby_dataframes"]
                ):
                    return True
        if (
            "include_columns" in option
            and dataframe.ww.name in option["include_columns"]
        ):
            return False
        elif "include_dataframes" in option:
            return dataframe.ww.name not in option["include_dataframes"]
        elif dataframe.ww.name in option["ignore_dataframes"]:
            return True
        else:
            return False

    return any([should_ignore_dataframe(option) for option in options])


def filter_groupby_matches_by_options(groupby_matches, options):
    return filter_matches_by_options(
        [(groupby_match,) for groupby_match in groupby_matches],
        options,
        groupby=True,
    )


def filter_matches_by_options(matches, options, groupby=False, commutative=False):
    # If more than one option, than need to handle each for each input
    if len(options) > 1:

        def is_valid_match(match):
            if all(
                [
                    column_filter(m, option, groupby)
                    for m, option in zip(match, options)
                ],
            ):
                return True
            else:
                return False

    else:

        def is_valid_match(match):
            if all([column_filter(f, options[0], groupby) for f in match]):
                return True
            else:
                return False

    valid_matches = set()
    for match in matches:
        if is_valid_match(match):
            valid_matches.add(match)
        elif commutative:
            for order in permutations(match):
                if is_valid_match(order):
                    valid_matches.add(order)
                    break

    return sorted(
        valid_matches,
        key=lambda features: ([feature.unique_name() for feature in features]),
    )


================================================
FILE: featuretools/primitives/standard/__init__.py
================================================
# flake8: noqa
from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive
from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.aggregation import *
from featuretools.primitives.standard.transform import *


================================================
FILE: featuretools/primitives/standard/aggregation/__init__.py
================================================
from featuretools.primitives.standard.aggregation.all_primitive import All
from featuretools.primitives.standard.aggregation.any_primitive import Any
from featuretools.primitives.standard.aggregation.avg_time_between import AvgTimeBetween
from featuretools.primitives.standard.aggregation.average_count_per_unique import (
    AverageCountPerUnique,
)
from featuretools.primitives.standard.aggregation.count import Count
from featuretools.primitives.standard.aggregation.count_above_mean import CountAboveMean
from featuretools.primitives.standard.aggregation.count_below_mean import CountBelowMean
from featuretools.primitives.standard.aggregation.count_greater_than import (
    CountGreaterThan,
)
from featuretools.primitives.standard.aggregation.count_inside_nth_std import (
    CountInsideNthSTD,
)
from featuretools.primitives.standard.aggregation.count_inside_range import (
    CountInsideRange,
)
from featuretools.primitives.standard.aggregation.count_less_than import CountLessThan
from featuretools.primitives.standard.aggregation.count_outside_nth_std import (
    CountOutsideNthSTD,
)
from featuretools.primitives.standard.aggregation.count_outside_range import (
    CountOutsideRange,
)
from featuretools.primitives.standard.aggregation.date_first_event import DateFirstEvent
from featuretools.primitives.standard.aggregation.entropy import Entropy
from featuretools.primitives.standard.aggregation.first import First
from featuretools.primitives.standard.aggregation.first_last_time_delta import (
    FirstLastTimeDelta,
)
from featuretools.primitives.standard.aggregation.kurtosis import Kurtosis
from featuretools.primitives.standard.aggregation.is_unique import IsUnique
from featuretools.primitives.standard.aggregation.last import Last
from featuretools.primitives.standard.aggregation.max_primitive import Max
from featuretools.primitives.standard.aggregation.max_consecutive_false import (
    MaxConsecutiveFalse,
)
from featuretools.primitives.standard.aggregation.max_consecutive_negatives import (
    MaxConsecutiveNegatives,
)
from featuretools.primitives.standard.aggregation.max_consecutive_positives import (
    MaxConsecutivePositives,
)
from featuretools.primitives.standard.aggregation.max_consecutive_true import (
    MaxConsecutiveTrue,
)
from featuretools.primitives.standard.aggregation.max_consecutive_zeros import (
    MaxConsecutiveZeros,
)
from featuretools.primitives.standard.aggregation.mean import Mean
from featuretools.primitives.standard.aggregation.median import Median
from featuretools.primitives.standard.aggregation.max_count import MaxCount
from featuretools.primitives.standard.aggregation.median_count import MedianCount
from featuretools.primitives.standard.aggregation.max_min_delta import MaxMinDelta
from featuretools.primitives.standard.aggregation.min_count import MinCount
from featuretools.primitives.standard.aggregation.min_primitive import Min
from featuretools.primitives.standard.aggregation.mode import Mode
from featuretools.primitives.standard.aggregation.n_unique_days import NUniqueDays
from featuretools.primitives.standard.aggregation.n_unique_days_of_calendar_year import (
    NUniqueDaysOfCalendarYear,
)
from featuretools.primitives.standard.aggregation.n_unique_days_of_month import (
    NUniqueDaysOfMonth,
)
from featuretools.primitives.standard.aggregation.has_no_duplicates import (
    HasNoDuplicates,
)
from featuretools.primitives.standard.aggregation.is_monotonically_decreasing import (
    IsMonotonicallyDecreasing,
)
from featuretools.primitives.standard.aggregation.is_monotonically_increasing import (
    IsMonotonicallyIncreasing,
)
from featuretools.primitives.standard.aggregation.n_unique_months import NUniqueMonths
from featuretools.primitives.standard.aggregation.n_unique_weeks import NUniqueWeeks
from featuretools.primitives.standard.aggregation.n_most_common import NMostCommon
from featuretools.primitives.standard.aggregation.n_most_common_frequency import (
    NMostCommonFrequency,
)
from featuretools.primitives.standard.aggregation.num_true import NumTrue
from featuretools.primitives.standard.aggregation.num_peaks import NumPeaks
from featuretools.primitives.standard.aggregation.num_zero_crossings import (
    NumZeroCrossings,
)
from featuretools.primitives.standard.aggregation.num_true_since_last_false import (
    NumTrueSinceLastFalse,
)
from featuretools.primitives.standard.aggregation.num_false_since_last_true import (
    NumFalseSinceLastTrue,
)
from featuretools.primitives.standard.aggregation.num_consecutive_greater_mean import (
    NumConsecutiveGreaterMean,
)
from featuretools.primitives.standard.aggregation.num_consecutive_less_mean import (
    NumConsecutiveLessMean,
)
from featuretools.primitives.standard.aggregation.num_unique import NumUnique
from featuretools.primitives.standard.aggregation.percent_unique import PercentUnique
from featuretools.primitives.standard.aggregation.percent_true import PercentTrue
from featuretools.primitives.standard.aggregation.skew import Skew
from featuretools.primitives.standard.aggregation.std import Std
from featuretools.primitives.standard.aggregation.sum_primitive import Sum
from featuretools.primitives.standard.aggregation.time_since_first import TimeSinceFirst
from featuretools.primitives.standard.aggregation.time_since_last import TimeSinceLast
from featuretools.primitives.standard.aggregation.time_since_last_true import (
    TimeSinceLastTrue,
)
from featuretools.primitives.standard.aggregation.time_since_last_min import (
    TimeSinceLastMin,
)
from featuretools.primitives.standard.aggregation.time_since_last_max import (
    TimeSinceLastMax,
)
from featuretools.primitives.standard.aggregation.time_since_last_false import (
    TimeSinceLastFalse,
)
from featuretools.primitives.standard.aggregation.trend import Trend
from featuretools.primitives.standard.aggregation.variance import Variance


================================================
FILE: featuretools/primitives/standard/aggregation/all_primitive.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class All(AggregationPrimitive):
    """Calculates if all values are 'True' in a list.

    Description:
        Given a list of booleans, return `True` if all
        of the values are `True`.

    Examples:
        >>> all = All()
        >>> all([False, False, False, True])
        False
    """

    name = "all"
    input_types = [
        [ColumnSchema(logical_type=Boolean)],
        [ColumnSchema(logical_type=BooleanNullable)],
    ]
    return_type = ColumnSchema(logical_type=Boolean)
    stack_on_self = False
    description_template = "whether all of {} are true"

    def get_function(self):
        return np.all


================================================
FILE: featuretools/primitives/standard/aggregation/any_primitive.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class Any(AggregationPrimitive):
    """Determines if any value is 'True' in a list.

    Description:
        Given a list of booleans, return `True` if one or
        more of the values are `True`.

    Examples:
        >>> any = Any()
        >>> any([False, False, False, True])
        True
    """

    name = "any"
    input_types = [
        [ColumnSchema(logical_type=Boolean)],
        [ColumnSchema(logical_type=BooleanNullable)],
    ]
    return_type = ColumnSchema(logical_type=Boolean)
    stack_on_self = False
    description_template = "whether any of {} are true"

    def get_function(self):
        return np.any


================================================
FILE: featuretools/primitives/standard/aggregation/average_count_per_unique.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

from featuretools.primitives.base import AggregationPrimitive


class AverageCountPerUnique(AggregationPrimitive):
    """Determines the average count across all unique value.

    Args:
        skipna (bool): Determines if to use NA/null values.
            Defaults to True to skip NA/null.

    Examples:
        Determine the average count values for all unique items
        in the input
        >>> input = [1, 1, 2, 2, 3, 4, 5, 6, 7, 8]
        >>> avg_count_per_unique = AverageCountPerUnique()
        >>> avg_count_per_unique(input)
        1.25

        Determine the average count values for all unique items
        in the input with nan values ignored
        >>> input = [1, 1, 2, 2, 3, 4, 5, None, 6, 7, 8]
        >>> avg_count_per_unique = AverageCountPerUnique()
        >>> avg_count_per_unique(input)
        1.25

        Determine the average count values for all unique items
        in the input with nan values included
        >>> input = [1, 2, 2, 3, 4, 5, None, 6, 7, 8, 9]
        >>> avg_count_per_unique_skipna_false = AverageCountPerUnique(skipna=False)
        >>> avg_count_per_unique_skipna_false(input)
        1.1
    """

    name = "average_count_per_unique"
    input_types = [ColumnSchema(semantic_tags={"category"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    default_value = 0

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def average_count_per_unique(x):
            return x.value_counts(
                dropna=self.skipna,
            ).mean(skipna=self.skipna)

        return average_count_per_unique


================================================
FILE: featuretools/primitives/standard/aggregation/avg_time_between.py
================================================
from datetime import datetime

import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive
from featuretools.utils import convert_time_units


class AvgTimeBetween(AggregationPrimitive):
    """Computes the average number of seconds between consecutive events.

    Description:
        Given a list of datetimes, return the average time (default in seconds)
        elapsed between consecutive events. If there are fewer
        than 2 non-null values, return `NaN`.

    Args:
        unit (str): Defines the unit of time.
            Defaults to seconds. Acceptable values:
            years, months, days, hours, minutes, seconds, milliseconds, nanoseconds

    Examples:
        >>> from datetime import datetime
        >>> avg_time_between = AvgTimeBetween()
        >>> times = [datetime(2010, 1, 1, 11, 45, 0),
        ...          datetime(2010, 1, 1, 11, 55, 15),
        ...          datetime(2010, 1, 1, 11, 57, 30)]
        >>> avg_time_between(times)
        375.0
        >>> avg_time_between = AvgTimeBetween(unit="minutes")
        >>> avg_time_between(times)
        6.25
    """

    name = "avg_time_between"
    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    description_template = "the average time between each of {}"

    def __init__(self, unit="seconds"):
        self.unit = unit.lower()

    def get_function(self):
        def pd_avg_time_between(x):
            """Assumes time scales are closer to order
            of seconds than to nanoseconds
            if times are much closer to nanoseconds
            we could get some floating point errors

            this can be fixed with another function
            that calculates the mean before converting
            to seconds
            """
            x = x.dropna()
            if x.shape[0] < 2:
                return np.nan
            if isinstance(x.iloc[0], (pd.Timestamp, datetime)):
                x = x.view("int64")
                # use len(x)-1 because we care about difference
                # between values, len(x)-1 = len(diff(x))

            avg = (x.max() - x.min()) / (len(x) - 1)
            avg = avg * 1e-9

            # long form:
            # diff_in_ns = x.diff().iloc[1:].astype('int64')
            # diff_in_seconds = diff_in_ns * 1e-9
            # avg = diff_in_seconds.mean()
            return convert_time_units(avg, self.unit)

        return pd_avg_time_between


================================================
FILE: featuretools/primitives/standard/aggregation/count.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class Count(AggregationPrimitive):
    """Determines the total number of values, excluding `NaN`.

    Examples:
        >>> count = Count()
        >>> count([1, 2, 3, 4, 5, None])
        5
    """

    name = "count"
    input_types = [ColumnSchema(semantic_tags={"index"})]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0
    description_template = "the number"

    def get_function(self):
        return pd.Series.count

    def generate_name(
        self,
        base_feature_names,
        relationship_path_name,
        parent_dataframe_name,
        where_str,
        use_prev_str,
    ):
        return "COUNT(%s%s%s)" % (relationship_path_name, where_str, use_prev_str)


================================================
FILE: featuretools/primitives/standard/aggregation/count_above_mean.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class CountAboveMean(AggregationPrimitive):
    """Calculates the number of values that are above the mean.

    Args:
        skipna (bool): Determines if to use NA/null values. Defaults to
            True to skip NA/null.

    Examples:
        >>> count_above_mean = CountAboveMean()
        >>> count_above_mean([1, 2, 3, 4, 5])
        2

        The way NaNs are treated can be controlled.

        >>> count_above_mean_skipna = CountAboveMean(skipna=False)
        >>> count_above_mean_skipna([1, 2, 3, 4, 5, None])
        nan
    """

    name = "count_above_mean"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    stack_on_self = False

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def count_above_mean(x):
            mean = x.mean(skipna=self.skipna)
            if np.isnan(mean):
                return np.nan
            return len(x[x > mean])

        return count_above_mean


================================================
FILE: featuretools/primitives/standard/aggregation/count_below_mean.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class CountBelowMean(AggregationPrimitive):
    """Determines the number of values that are below the mean.

    Args:
        skipna (bool): Determines if to use NA/null values. Defaults to
            True to skip NA/null.

    Examples:
        >>> count_below_mean = CountBelowMean()
        >>> count_below_mean([1, 2, 3, 4, 10])
        3

        The way NaNs are treated can be controlled.

        >>> count_below_mean_skipna = CountBelowMean(skipna=False)
        >>> count_below_mean_skipna([1, 2, 3, 4, 5, None])
        nan
    """

    name = "count_below_mean"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    stack_on_self = False

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def count_below_mean(x):
            mean = x.mean(skipna=self.skipna)
            if np.isnan(mean):
                return np.nan
            return len(x[x < mean])

        return count_below_mean


================================================
FILE: featuretools/primitives/standard/aggregation/count_greater_than.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Integer

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class CountGreaterThan(AggregationPrimitive):
    """Determines the number of values greater than a controllable threshold.

    Args:
        threshold (float): The threshold to use when counting the number
            of values greater than. Defaults to 10.

    Examples:
        >>> count_greater_than = CountGreaterThan(threshold=3)
        >>> count_greater_than([1, 2, 3, 4, 5])
        2
    """

    name = "count_greater_than"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, threshold=10):
        self.threshold = threshold

    def get_function(self):
        def count_greater_than(x):
            return x[x > self.threshold].count()

        return count_greater_than


================================================
FILE: featuretools/primitives/standard/aggregation/count_inside_nth_std.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Integer

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class CountInsideNthSTD(AggregationPrimitive):
    """Determines the count of observations that lie inside
        the first N standard deviations (inclusive).

    Args:
        n (float): Number of standard deviations. Default is 1

    Examples:
        >>> count_inside_nth_std = CountInsideNthSTD(n=1.5)
        >>> count_inside_nth_std([1, 10, 15, 20, 100])
        4
    """

    name = "count_inside_nth_std"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, n=1):
        if n < 0:
            raise ValueError("n must be a positive number")

        self.n = n

    def get_function(self):
        def count_inside_nth_std(x):
            cond = np.abs(x - np.mean(x)) <= np.std(x) * self.n
            return cond.sum()

        return count_inside_nth_std


================================================
FILE: featuretools/primitives/standard/aggregation/count_inside_range.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class CountInsideRange(AggregationPrimitive):
    """Determines the number of values that fall within a certain range.

    Args:
        lower (float): Lower boundary of range (inclusive). Default is 0.
        upper (float): Upper boundary of range (inclusive). Default is 1.
        skipna (bool): If this is False any value in x is NaN then
            the result will be NaN. If True, `nan` values are skipped.
            Default is True.

    Examples:
        >>> count_inside_range = CountInsideRange(lower=1.5, upper=3.6)
        >>> count_inside_range([1, 2, 3, 4, 5])
        2

        The way NaNs are treated can be controlled.

        >>> count_inside_range_skipna = CountInsideRange(skipna=False)
        >>> count_inside_range_skipna([1, 2, 3, 4, 5, None])
        nan
    """

    name = "count_inside_range"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, lower=0, upper=1, skipna=True):
        self.lower = lower
        self.upper = upper
        self.skipna = skipna

    def get_function(self):
        def count_inside_range(x):
            if not self.skipna and x.isnull().values.any():
                return np.nan
            cond = (self.lower <= x) & (x <= self.upper)
            return cond.sum()

        return count_inside_range


================================================
FILE: featuretools/primitives/standard/aggregation/count_less_than.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Integer

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class CountLessThan(AggregationPrimitive):
    """Determines the number of values less than a controllable threshold.

    Args:
        threshold (float): The threshold to use when counting the number
            of values less than. Defaults to 10.

    Examples:
        >>> count_less_than = CountLessThan(threshold=3.5)
        >>> count_less_than([1, 2, 3, 4, 5])
        3
    """

    name = "count_less_than"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, threshold=10):
        self.threshold = threshold

    def get_function(self):
        def count_less_than(x):
            return x[x < self.threshold].count()

        return count_less_than


================================================
FILE: featuretools/primitives/standard/aggregation/count_outside_nth_std.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Integer

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class CountOutsideNthSTD(AggregationPrimitive):
    """Determines the number of observations that lie outside
        the first N standard deviations.

    Args:
        n (float): Number of standard deviations. Default is 1

    Examples:
        >>> count_outside_nth_std = CountOutsideNthSTD(n=1.5)
        >>> count_outside_nth_std([1, 10, 15, 20, 100])
        1
    """

    name = "count_outside_nth_std"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, n=1):
        if n < 0:
            raise ValueError("n must be a positive number")

        self.n = n

    def get_function(self):
        def count_outside_nth_std(x):
            cond = np.abs(x - np.mean(x)) > np.std(x) * self.n
            return cond.sum()

        return count_outside_nth_std


================================================
FILE: featuretools/primitives/standard/aggregation/count_outside_range.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class CountOutsideRange(AggregationPrimitive):
    """Determines the number of values that fall outside a certain range.

    Args:
        lower (float): Lower boundary of range (exclusive). Default is 0.
        upper (float): Upper boundary of range (exclusive). Default is 1.
        skipna (bool): Determines if to use NA/null values. Defaults to
            True to skip NA/null.

    Examples:
        >>> count_outside_range = CountOutsideRange(lower=1.5, upper=3.6)
        >>> count_outside_range([1, 2, 3, 4, 5])
        3

        The way NaNs are treated can be controlled.

        >>> count_outside_range_skipna = CountOutsideRange(skipna=False)
        >>> count_outside_range_skipna([1, 2, 3, 4, 5, None])
        nan
    """

    name = "count_outside_range"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, lower=0, upper=1, skipna=True):
        self.lower = lower
        self.upper = upper
        self.skipna = skipna

    def get_function(self):
        def count_outside_range(x):
            if not self.skipna and x.isnull().values.any():
                return np.nan
            cond = (x < self.lower) | (x > self.upper)
            return cond.sum()

        return count_outside_range


================================================
FILE: featuretools/primitives/standard/aggregation/date_first_event.py
================================================
from pandas import NaT
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime

from featuretools.primitives.base import AggregationPrimitive


class DateFirstEvent(AggregationPrimitive):
    """Determines the first datetime from a list of datetimes.

    Examples:
        >>> from datetime import datetime
        >>> date_first_event = DateFirstEvent()
        >>> date_first_event([
        ...     datetime(2011, 4, 9, 10, 30, 10),
        ...     datetime(2011, 4, 9, 10, 30, 20),
        ...     datetime(2011, 4, 9, 10, 30, 30)])
        Timestamp('2011-04-09 10:30:10')
    """

    name = "date_first_event"
    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})]
    return_type = ColumnSchema(logical_type=Datetime)
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def date_first_event(x):
            x = x.dropna()
            if x.empty:
                return NaT
            return x.iat[0]

        return date_first_event


================================================
FILE: featuretools/primitives/standard/aggregation/entropy.py
================================================
from scipy import stats
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class Entropy(AggregationPrimitive):
    """Calculates the entropy for a categorical column

    Description:
        Given a list of observations from a categorical
        column return the entropy of the distribution.
        NaN values can be treated as a category or
        dropped.

    Args:
        dropna (bool): Whether to consider NaN values as a separate category
            Defaults to False.
        base (float): The logarithmic base to use
            Defaults to e (natural logarithm)

    Examples:
        >>> pd_entropy = Entropy()
        >>> pd_entropy([1, 2, 3, 4])
        1.3862943611198906
    """

    name = "entropy"
    input_types = [ColumnSchema(semantic_tags={"category"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    stack_on_self = False
    description_template = "the entropy of {}"

    def __init__(self, dropna=False, base=None):
        self.dropna = dropna
        self.base = base

    def get_function(self):
        def pd_entropy(s):
            distribution = s.value_counts(normalize=True, dropna=self.dropna)
            if distribution.dtype == "Float64":
                distribution = distribution.astype("float64")
            return stats.entropy(distribution.to_numpy(), base=self.base)

        return pd_entropy


================================================
FILE: featuretools/primitives/standard/aggregation/first.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class First(AggregationPrimitive):
    """Determines the first value in a list.

    Examples:
        >>> first = First()
        >>> first([1, 2, 3, 4, 5, None])
        1.0
    """

    name = "first"
    input_types = [ColumnSchema()]
    return_type = None
    stack_on_self = False
    description_template = "the first instance of {}"

    def get_function(self):
        def pd_first(x):
            return x.iloc[0]

        return pd_first


================================================
FILE: featuretools/primitives/standard/aggregation/first_last_time_delta.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base import AggregationPrimitive


class FirstLastTimeDelta(AggregationPrimitive):
    """Determines the time between the first and last time value
        in seconds.

    Examples:
        >>> from datetime import datetime
        >>> first_last_time_delta = FirstLastTimeDelta()
        >>> first_last_time_delta([
        ...     datetime(2011, 4, 9, 10, 30, 0),
        ...     datetime(2011, 4, 9, 10, 30, 15),
        ...     datetime(2011, 4, 9, 10, 30, 35)])
        35.0
    """

    name = "first_last_time_delta"
    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_calc_time = False
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def first_last_time_delta(datetime_col):
            datetime_col = datetime_col.dropna()
            if datetime_col.empty:
                return np.nan
            delta = datetime_col.iloc[-1] - datetime_col.iloc[0]
            return delta.total_seconds()

        return first_last_time_delta


================================================
FILE: featuretools/primitives/standard/aggregation/has_no_duplicates.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable

from featuretools.primitives.base import AggregationPrimitive


class HasNoDuplicates(AggregationPrimitive):
    """Determines if there are duplicates in the input.

    Args:
        skipna (bool): Determines if to use NA/null values.
            Defaults to True to skip NA/null.

    Examples:
        >>> has_no_duplicates = HasNoDuplicates()
        >>> has_no_duplicates([1, 1, 2])
        False
        >>> has_no_duplicates([1, 2, 3])
        True

        NaNs are skipped by default.

        >>> has_no_duplicates([1, 2, 3, None, None])
        True

        However, the way NaNs are treated can be controlled.

        >>> has_no_duplicates_skipna = HasNoDuplicates(skipna=False)
        >>> has_no_duplicates_skipna([1, 2, 3, None, None])
        False
        >>> has_no_duplicates_skipna([1, 2, 3, None])
        True
    """

    name = "has_no_duplicates"
    input_types = [
        [ColumnSchema(semantic_tags={"category"})],
        [ColumnSchema(semantic_tags={"numeric"})],
    ]
    return_type = ColumnSchema(logical_type=BooleanNullable)
    stack_on_self = False
    default_value = True

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def has_no_duplicates(data):
            if self.skipna:
                data = data.dropna()
            return not data.duplicated().any()

        return has_no_duplicates


================================================
FILE: featuretools/primitives/standard/aggregation/is_monotonically_decreasing.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable

from featuretools.primitives.base import AggregationPrimitive


class IsMonotonicallyDecreasing(AggregationPrimitive):
    """Determines if a series is monotonically decreasing.

    Description:
        Given a list of numeric values, return True if the
        values are strictly decreasing. If the series contains
        `NaN` values, they will be skipped.

    Examples:
        >>> is_monotonically_decreasing = IsMonotonicallyDecreasing()
        >>> is_monotonically_decreasing([9, 5, 3, 1])
        True
    """

    name = "is_monotonically_decreasing"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=BooleanNullable)
    stack_on_self = False
    default_value = False

    def get_function(self):
        def is_monotonically_decreasing(x):
            return x.dropna().is_monotonic_decreasing

        return is_monotonically_decreasing


================================================
FILE: featuretools/primitives/standard/aggregation/is_monotonically_increasing.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable

from featuretools.primitives.base import AggregationPrimitive


class IsMonotonicallyIncreasing(AggregationPrimitive):
    """Determines if a series is monotonically increasing.

    Description:
        Given a list of numeric values, return True if the
        values are strictly increasing. If the series contains
        `NaN` values, they will be skipped.

    Examples:
        >>> is_monotonically_increasing = IsMonotonicallyIncreasing()
        >>> is_monotonically_increasing([1, 3, 5, 9])
        True
    """

    name = "is_monotonically_increasing"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=BooleanNullable)
    stack_on_self = False
    default_value = False

    def get_function(self):
        def is_monotonically_increasing(x):
            return x.dropna().is_monotonic_increasing

        return is_monotonically_increasing


================================================
FILE: featuretools/primitives/standard/aggregation/is_unique.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable

from featuretools.primitives.base import AggregationPrimitive


class IsUnique(AggregationPrimitive):
    """Determines whether or not a series of discrete is all unique.

    Description:
        Given a series of discrete values, return True if each
        value in the series is unique. If any value is repeated,
        return False.

    Examples:
        >>> is_unique = IsUnique()
        >>> is_unique(['red', 'blue', 'green', 'yellow'])
        True

        If the series is not unique, return False

        >>> is_unique = IsUnique()
        >>> is_unique(['red', 'blue', 'green', 'blue'])
        False
    """

    name = "is_unique"
    input_types = [ColumnSchema(semantic_tags={"category"})]
    return_type = ColumnSchema(logical_type=BooleanNullable)
    stack_on_self = False
    default_value = False

    def get_function(self):
        def is_unique(x):
            return x.is_unique

        return is_unique


================================================
FILE: featuretools/primitives/standard/aggregation/kurtosis.py
================================================
from scipy.stats import kurtosis
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double, Integer

from featuretools.primitives.base import AggregationPrimitive


class Kurtosis(AggregationPrimitive):
    """Calculates the kurtosis for a list of numbers

    Args:
        fisher (bool): Optional. If True, Fisher's definition is used
            (normal ==> 0.0). If False, Pearson's definition is used
            (normal ==> 3.0). Default is True.
        bias (bool): Optional. If False, then the calculations are
            corrected for statistical bias. Default is True.
        nan_policy (str): Optional. Defines how to handle when
            input contains Nan. Possible values include
            `['propagate', 'raise', 'omit']`. 'propagate'
            returns Nan, 'raise' throws an error, 'omit'
            performs the calculations ignoring Nan values.
            Default is 'propagate'.

    Examples:
        >>> kurtosis = Kurtosis()
        >>> kurtosis([1, 2, 3, 4, 5])
        -1.3

        You can use Pearson's definition by setting the 'fisher' argument to False

        >>> kurtosis_fisher = Kurtosis(fisher=False)
        >>> kurtosis_fisher([1, 2, 3, 4, 5])
        1.7

        You can correct for statistical bias by setting the 'bias' argument to False

        >>> kurtosis_bias = Kurtosis(bias=False)
        >>> kurtosis_bias([1, 2, 3, 4, 5])
        -1.2000000000000004

        You can specifiy how to handle NaN values in the input with the 'nan_policy'
        argument

        >>> kurtosis_nan_policy = Kurtosis(nan_policy='omit')
        >>> kurtosis_nan_policy([1, 2, None, 3, 4, 5])
        -1.3
    """

    name = "kurtosis"
    input_types = [
        [ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})],
        [ColumnSchema(logical_type=Double, semantic_tags={"numeric"})],
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, fisher=True, bias=True, nan_policy="propagate"):
        if nan_policy not in ["propagate", "raise", "omit"]:
            raise ValueError("Invalid nan_policy")
        self.fisher = fisher
        self.bias = bias
        self.nan_policy = nan_policy

    def get_function(self):
        def kurtosis_func(x):
            return kurtosis(
                x,
                axis=0,
                fisher=self.fisher,
                bias=self.bias,
                nan_policy=self.nan_policy,
            )

        return kurtosis_func


================================================
FILE: featuretools/primitives/standard/aggregation/last.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class Last(AggregationPrimitive):
    """Determines the last value in a list.

    Examples:
        >>> last = Last()
        >>> last([1, 2, 3, 4, 5, None])
        nan
    """

    name = "last"
    input_types = [ColumnSchema()]
    return_type = None
    stack_on_self = False
    description_template = "the last instance of {}"

    def get_function(self):
        def pd_last(x):
            return x.iloc[-1]

        return pd_last


================================================
FILE: featuretools/primitives/standard/aggregation/max_consecutive_false.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, Integer

from featuretools.primitives.base import AggregationPrimitive


class MaxConsecutiveFalse(AggregationPrimitive):
    """Determines the maximum number of consecutive False values in the input

    Examples:
        >>> max_consecutive_false = MaxConsecutiveFalse()
        >>> max_consecutive_false([True, False, False, True, True, False])
        2
    """

    name = "max_consecutive_false"
    input_types = [ColumnSchema(logical_type=Boolean)]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def max_consecutive_false(x):
            # invert the input array to work properly with the computation
            x[x.notnull()] = ~(x[x.notnull()].astype(bool))
            # find the locations where the value changes from the previous value
            not_equal = x != x.shift()
            # Use cumulative sum to determine where consecutive values occur. When the
            # sum changes, consecutive False values are present, when the cumulative
            # sum remains unchnaged, consecutive True values are present.
            not_equal_sum = not_equal.cumsum()
            # group the input by the cumulative sum values and use cumulative count
            # to count the number of consecutive values. Add 1 to account for the cumulative
            # sum starting at zero where the first True occurs
            consecutive = x.groupby(not_equal_sum).cumcount() + 1
            # multiply by the inverted input to keep only the counts that correspond to
            # false values
            consecutive_false = consecutive * x
            # return the max of all the consecutive false values
            return consecutive_false.max()

        return max_consecutive_false


================================================
FILE: featuretools/primitives/standard/aggregation/max_consecutive_negatives.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double, Integer

from featuretools.primitives.base import AggregationPrimitive


class MaxConsecutiveNegatives(AggregationPrimitive):
    """Determines the maximum number of consecutive negative values in the input

    Args:
        skipna (bool): Ignore any `NaN` values in the input. Default is True.

    Examples:
        >>> max_consecutive_negatives = MaxConsecutiveNegatives()
        >>> max_consecutive_negatives([1.0, -1.4, -2.4, -5.4, 2.9, -4.3])
        3

        `NaN` values can be ignored with the `skipna` parameter

        >>> max_consecutive_negatives_skipna = MaxConsecutiveNegatives(skipna=False)
        >>> max_consecutive_negatives_skipna([1.0, 1.4, -2.4, None, -2.9, -4.3])
        2
    """

    name = "max_consecutive_negatives"
    input_types = [
        [ColumnSchema(logical_type=Integer)],
        [ColumnSchema(logical_type=Double)],
    ]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def max_consecutive_negatives(x):
            if self.skipna:
                x = x.dropna()
            # convert the numeric values to booleans for processing
            x[x.notnull()] = x[x.notnull()].lt(0)
            # find the locations where the value changes from the previous value
            not_equal = x != x.shift()
            # Use cumulative sum to determine where consecutive values occur. When the
            # sum changes, consecutive non-negative values are present, when the cumulative
            # sum remains unchnaged, consecutive negative values are present.
            not_equal_sum = not_equal.cumsum()
            # group the input by the cumulative sum values and use cumulative count
            # to count the number of consecutive values. Add 1 to account for the cumulative
            # sum starting at zero where the first negative occurs
            consecutive = x.groupby(not_equal_sum).cumcount() + 1
            # multiply by the inverted input to keep only the counts that correspond to
            # negative values
            consecutive_neg = consecutive * x
            # return the max of all the consecutive negative values
            return consecutive_neg.max()

        return max_consecutive_negatives


================================================
FILE: featuretools/primitives/standard/aggregation/max_consecutive_positives.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double, Integer

from featuretools.primitives.base import AggregationPrimitive


class MaxConsecutivePositives(AggregationPrimitive):
    """Determines the maximum number of consecutive positive values in the input

    Args:
        skipna (bool): Ignore any `NaN` values in the input. Default is True.

    Examples:
        >>> max_consecutive_positives = MaxConsecutivePositives()
        >>> max_consecutive_positives([1.0, -1.4, 2.4, 5.4, 2.9, -4.3])
        3

        `NaN` values can be ignored with the `skipna` parameter

        >>> max_consecutive_positives_skipna = MaxConsecutivePositives(skipna=False)
        >>> max_consecutive_positives_skipna([1.0, -1.4, 2.4, None, 2.9, 4.3])
        2
    """

    name = "max_consecutive_positives"
    input_types = [
        [ColumnSchema(logical_type=Integer)],
        [ColumnSchema(logical_type=Double)],
    ]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def max_consecutive_positives(x):
            if self.skipna:
                x = x.dropna()
            # convert the numeric values to booleans for processing
            x[x.notnull()] = x[x.notnull()].gt(0)
            # find the locations where the value changes from the previous value
            not_equal = x != x.shift()
            # Use cumulative sum to determine where consecutive values occur. When the
            # sum changes, consecutive non-positive values are present, when the cumulative
            # sum remains unchnaged, consecutive positive values are present.
            not_equal_sum = not_equal.cumsum()
            # group the input by the cumulative sum values and use cumulative count
            # to count the number of consecutive values. Add 1 to account for the cumulative
            # sum starting at zero where the first positive occurs
            consecutive = x.groupby(not_equal_sum).cumcount() + 1
            # multiply by the inverted input to keep only the counts that correspond to
            # positive values
            consecutive_pos = consecutive * x
            # return the max of all the consecutive positive values
            return consecutive_pos.max()

        return max_consecutive_positives


================================================
FILE: featuretools/primitives/standard/aggregation/max_consecutive_true.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, Integer

from featuretools.primitives.base import AggregationPrimitive


class MaxConsecutiveTrue(AggregationPrimitive):
    """Determines the maximum number of consecutive True values in the input

    Examples:
        >>> max_consecutive_true = MaxConsecutiveTrue()
        >>> max_consecutive_true([True, False, True, True, True, False])
        3
    """

    name = "max_consecutive_true"
    input_types = [ColumnSchema(logical_type=Boolean)]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def max_consecutive_true(x):
            # find the locations where the value changes from the previous value
            not_equal = x != x.shift()
            # use cumulative sum to determine where consecutive values occur. When the
            # sum changes, consecutive False values are present, when the cumulative
            # sum remains unchnaged, consecutive True values are present.
            not_equal_sum = not_equal.cumsum()
            # group the input by the cumulative sum values and use cumulative count
            # to count the number of consecutive values. Add 1 to account for the cumulative
            # sum starting at zero where the first True occurs
            consecutive = x.groupby(not_equal_sum).cumcount() + 1
            # multiply by the original input to keep only the counts that correspond to
            # true values
            consecutive_true = consecutive * x
            # return the max of all the consecutive true values
            return consecutive_true.max()

        return max_consecutive_true


================================================
FILE: featuretools/primitives/standard/aggregation/max_consecutive_zeros.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double, Integer

from featuretools.primitives.base import AggregationPrimitive


class MaxConsecutiveZeros(AggregationPrimitive):
    """Determines the maximum number of consecutive zero values in the input

    Args:
        skipna (bool): Ignore any `NaN` values in the input. Default is True.

    Examples:
        >>> max_consecutive_zeros = MaxConsecutiveZeros()
        >>> max_consecutive_zeros([1.0, -1.4, 0, 0.0, 0, -4.3])
        3

        `NaN` values can be ignored with the `skipna` parameter

        >>> max_consecutive_zeros_skipna = MaxConsecutiveZeros(skipna=False)
        >>> max_consecutive_zeros_skipna([1.0, -1.4, 0, None, 0.0, -4.3])
        1
    """

    name = "max_consecutive_zeros"
    input_types = [
        [ColumnSchema(logical_type=Integer)],
        [ColumnSchema(logical_type=Double)],
    ]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def max_consecutive_zeros(x):
            if self.skipna:
                x = x.dropna()
            # convert the numeric values to booleans for processing
            x[x.notnull()] = x[x.notnull()].eq(0)
            # find the locations where the value changes from the previous value
            not_equal = x != x.shift()
            # Use cumulative sum to determine where consecutive values occur. When the
            # sum changes, consecutive non-zero values are present, when the cumulative
            # sum remains unchnaged, consecutive zero values are present.
            not_equal_sum = not_equal.cumsum()
            # group the input by the cumulative sum values and use cumulative count
            # to count the number of consecutive values. Add 1 to account for the cumulative
            # sum starting at zero where the first zero occurs
            consecutive = x.groupby(not_equal_sum).cumcount() + 1
            # multiply by the boolean input to keep only the counts that correspond to
            # zero values
            consecutive_zero = consecutive * x
            # return the max of all the consecutive zero values
            return consecutive_zero.max()

        return max_consecutive_zeros


================================================
FILE: featuretools/primitives/standard/aggregation/max_count.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import AggregationPrimitive


class MaxCount(AggregationPrimitive):
    """Calculates the number of occurrences of the max value in a list

    Args:
        skipna (bool): Determines if to use NA/null values. Defaults to
            True to skip NA/null. If skipna is False, and there are NaN
            values in the array, the max will be NaN regardless of
            the other values, and NaN will be returned.

    Examples:
        >>> max_count = MaxCount()
        >>> max_count([1, 2, 5, 1, 5, 3, 5])
        3

        You can optionally specify how to handle NaN values

        >>> max_count_skipna = MaxCount(skipna=False)
        >>> max_count_skipna([1, 2, 5, 1, 5, 3, None])
        nan
    """

    name = "max_count"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def max_count(x):
            xmax = x.max(skipna=self.skipna)
            if np.isnan(xmax):
                return np.nan
            return x.eq(xmax).sum()

        return max_count


================================================
FILE: featuretools/primitives/standard/aggregation/max_min_delta.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import AggregationPrimitive


class MaxMinDelta(AggregationPrimitive):
    """Determines the difference between the max and min value.

    Args:
        skipna (bool): Determines if to use NA/null values.
            Defaults to True to skip NA/null.

    Examples:
        >>> max_min_delta = MaxMinDelta()
        >>> max_min_delta([7, 2, 5, 3, 10])
        8

        You can optionally specify how to handle NaN values

        >>> max_min_delta_skipna = MaxMinDelta(skipna=False)
        >>> max_min_delta_skipna([7, 2, None, 3, 10])
        nan
    """

    name = "max_min_delta"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def max_min_delta(x):
            max_val = x.max(skipna=self.skipna)
            min_val = x.min(skipna=self.skipna)
            return max_val - min_val

        return max_min_delta


================================================
FILE: featuretools/primitives/standard/aggregation/max_primitive.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class Max(AggregationPrimitive):
    """Calculates the highest value, ignoring `NaN` values.

    Examples:
        >>> max = Max()
        >>> max([1, 2, 3, 4, 5, None])
        5.0
    """

    name = "max"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    stack_on_self = False
    description_template = "the maximum of {}"

    def get_function(self):
        return np.max


================================================
FILE: featuretools/primitives/standard/aggregation/mean.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class Mean(AggregationPrimitive):
    """Computes the average for a list of values.

    Args:
        skipna (bool): Determines if to use NA/null values. Defaults to
            True to skip NA/null.

    Examples:
        >>> mean = Mean()
        >>> mean([1, 2, 3, 4, 5, None])
        3.0

        We can also control the way `NaN` values are handled.

        >>> mean = Mean(skipna=False)
        >>> mean([1, 2, 3, 4, 5, None])
        nan
    """

    name = "mean"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    description_template = "the average of {}"

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        if self.skipna:
            # np.mean of series is functionally nanmean
            return np.mean

        def mean(series):
            return np.mean(series.values)

        return mean


================================================
FILE: featuretools/primitives/standard/aggregation/median.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class Median(AggregationPrimitive):
    """Determines the middlemost number in a list of values.

    Examples:
        >>> median = Median()
        >>> median([5, 3, 2, 1, 4])
        3.0

        `NaN` values are ignored.

        >>> median([5, 3, 2, 1, 4, None])
        3.0
    """

    name = "median"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    description_template = "the median of {}"

    def get_function(self):
        return pd.Series.median


================================================
FILE: featuretools/primitives/standard/aggregation/median_count.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable

from featuretools.primitives.base import AggregationPrimitive


class MedianCount(AggregationPrimitive):
    """Calculates the number of occurrences of the median value in a list

    Args:
        skipna (bool): Determines if to use NA/null values. Defaults to
            True to skip NA/null. If skipna is False, and there are NaN
            values in the array, the median will be NaN, regardless of
            the other values.

    Examples:
        >>> median_count = MedianCount()
        >>> median_count([1, 2, 3, 1, 5, 3, 5])
        2

        You can optionally specify how to handle NaN values

        >>> median_count_skipna = MedianCount(skipna=False)
        >>> median_count_skipna([1, 2, 3, 1, 5, 3, None])
        nan
    """

    name = "median_count"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def median_count(x):
            median = x.median(skipna=self.skipna)
            if np.isnan(median):
                return np.nan
            return x.eq(median).sum()

        return median_count


================================================
FILE: featuretools/primitives/standard/aggregation/min_count.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable

from featuretools.primitives.base import AggregationPrimitive


class MinCount(AggregationPrimitive):
    """Calculates the number of occurrences of the min value in a list

    Args:
        skipna (bool): Determines if to use NA/null values. Defaults to
            True to skip NA/null. If skipna is False, and there are NaN
            values in the array, the min will be NaN regardless of
            the other values, and NaN will be returned.

    Examples:
        >>> min_count = MinCount()
        >>> min_count([1, 2, 5, 1, 5, 3, 5])
        2

        You can optionally specify how to handle NaN values

        >>> min_count_skipna = MinCount(skipna=False)
        >>> min_count_skipna([1, 2, 5, 1, 5, 3, None])
        nan
    """

    name = "min_count"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def min_count(x):
            xmin = x.min(skipna=self.skipna)
            if np.isnan(xmin):
                return np.nan
            return x.eq(xmin).sum()

        return min_count


================================================
FILE: featuretools/primitives/standard/aggregation/min_primitive.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class Min(AggregationPrimitive):
    """Calculates the smallest value, ignoring `NaN` values.

    Examples:
        >>> min = Min()
        >>> min([1, 2, 3, 4, 5, None])
        1.0
    """

    name = "min"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    stack_on_self = False
    description_template = "the minimum of {}"

    def get_function(self):
        return np.min


================================================
FILE: featuretools/primitives/standard/aggregation/mode.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class Mode(AggregationPrimitive):
    """Determines the most commonly repeated value.

    Description:
        Given a list of values, return the value with the
        highest number of occurences. If list is
        empty, return `NaN`.

    Examples:
        >>> mode = Mode()
        >>> mode(['red', 'blue', 'green', 'blue'])
        'blue'
    """

    name = "mode"
    input_types = [ColumnSchema(semantic_tags={"category"})]
    return_type = None
    description_template = "the most frequently occurring value of {}"

    def get_function(self):
        def pd_mode(s):
            return s.mode().get(0, np.nan)

        return pd_mode


================================================
FILE: featuretools/primitives/standard/aggregation/n_most_common.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class NMostCommon(AggregationPrimitive):
    """Determines the `n` most common elements.

    Description:
        Given a list of values, return the `n` values
        which appear the most frequently. If there are
        fewer than `n` unique values, the output will be
        filled with `NaN`.

    Args:
        n (int): defines "n" in "n most common." Defaults
            to 3.

    Examples:
        >>> n_most_common = NMostCommon(n=2)
        >>> x = ['orange', 'apple', 'orange', 'apple', 'orange', 'grapefruit']
        >>> n_most_common(x).tolist()
        ['orange', 'apple']
    """

    name = "n_most_common"
    input_types = [ColumnSchema(semantic_tags={"category"})]
    return_type = None

    def __init__(self, n=3):
        self.n = n
        self.number_output_features = n
        self.description_template = [
            "the {} most common values of {{}}".format(n),
            "the most common value of {}",
            *["the {nth_slice} most common value of {}"] * (n - 1),
        ]

    def get_function(self):
        def n_most_common(x):
            # Counts of 0 remain in value_counts output if dtype is category
            # so we need to remove them
            counts = x.value_counts()
            counts = counts[counts > 0]
            array = np.array(counts.index[: self.n])
            if len(array) < self.n:
                filler = np.full(self.n - len(array), np.nan)
                array = np.append(array, filler)
            return array

        return n_most_common


================================================
FILE: featuretools/primitives/standard/aggregation/n_most_common_frequency.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Categorical

from featuretools.primitives.base import AggregationPrimitive


class NMostCommonFrequency(AggregationPrimitive):
    """Determines the frequency of the n most common items.

    Args:
        n (int): defines "n" in "n most common". Defaults to 3.
        skipna (bool): Determines if to use NA/null values.
            Defaults to True to skip NA/null.

    Description:
        Given a list, find the n most common items, and return a series
        showing the frequency of each item. If the list has less than n unique
        values, the resulting series will be padded with nan.

    Examples:
        >>> n_most_common_frequency = NMostCommonFrequency()
        >>> n_most_common_frequency([1, 1, 1, 2, 2, 3, 4, 4]).to_list()
        [3, 2, 2]

        We can increase n to include more items.

        >>> n_most_common_frequency = NMostCommonFrequency(4)
        >>> n_most_common_frequency([1, 1, 1, 2, 2, 3, 4, 4]).to_list()
        [3, 2, 2, 1]

        NaNs are skipped by default.

        >>> n_most_common_frequency = NMostCommonFrequency(3)
        >>> n_most_common_frequency([1, 1, 1, 2, 2, 3, 4, 4, None, None, None]).to_list()
        [3, 2, 2]

        However, the way NaNs are treated can be controlled.

        >>> n_most_common_frequency = NMostCommonFrequency(3, skipna=False)
        >>> n_most_common_frequency([1, 1, 1, 2, 2, 3, 4, 4, None, None, None]).to_list()
        [3, 3, 2]
    """

    name = "n_most_common_frequency"
    input_types = [ColumnSchema(semantic_tags={"category"})]
    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})

    def __init__(self, n=3, skipna=True):
        self.n = n
        self.number_output_features = n
        self.skipna = skipna

    def get_function(self):
        def n_most_common_frequency(data, n=self.n):
            frequencies = data.value_counts(dropna=self.skipna)
            n_most_common = frequencies.iloc[0:n]
            nan_add = n - frequencies.shape[0]
            if nan_add > 0:
                n_most_common = pd.concat(
                    [n_most_common, pd.Series([np.nan] * nan_add)],
                )
            return n_most_common

        return n_most_common_frequency


================================================
FILE: featuretools/primitives/standard/aggregation/n_unique_days.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Integer

from featuretools.primitives.base import AggregationPrimitive


class NUniqueDays(AggregationPrimitive):
    """Determines the number of unique days.

    Description:
        Given a list of datetimes, return the number of unique days.
        The same day in two different years is treated as different. So
        Feb 21, 2017 is different than Feb 21, 2019, even though they are
        both the 21st of February.

    Examples:
        >>> from datetime import datetime
        >>> n_unique_days = NUniqueDays()
        >>> times = [datetime(2019, 2, 1),
        ...          datetime(2019, 2, 1),
        ...          datetime(2018, 2, 1),
        ...          datetime(2019, 1, 1)]
        >>> n_unique_days(times)
        3
    """

    name = "n_unique_days"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def n_unique_days(x):
            return x.dt.floor("D").nunique()

        return n_unique_days


================================================
FILE: featuretools/primitives/standard/aggregation/n_unique_days_of_calendar_year.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Integer

from featuretools.primitives.base import AggregationPrimitive


class NUniqueDaysOfCalendarYear(AggregationPrimitive):
    """Determines the number of unique calendar days.

    Description:
        Given a list of datetimes, return the number of unique calendar
        days. The same date in two different years is counted as one. So
        Feb 21, 2017 is not unique from Feb 21, 2019.

    Examples:
        >>> from datetime import datetime
        >>> n_unique_days_of_calendar_year = NUniqueDaysOfCalendarYear()
        >>> times = [datetime(2019, 2, 1),
        ...          datetime(2019, 2, 1),
        ...          datetime(2018, 2, 1),
        ...          datetime(2019, 1, 1)]
        >>> n_unique_days_of_calendar_year(times)
        2
    """

    name = "n_unique_days_of_calendar_year"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def n_unique_days_of_calendar_year(x):
            return x.dropna().dt.strftime("%m-%d").nunique()

        return n_unique_days_of_calendar_year


================================================
FILE: featuretools/primitives/standard/aggregation/n_unique_days_of_month.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Integer

from featuretools.primitives.base import AggregationPrimitive


class NUniqueDaysOfMonth(AggregationPrimitive):
    """Determines the number of unique days of month.

    Description:
        Given a list of datetimes, return the number of unique days
        of month. The maximum value is 31. 2018-01-01 and 2018-02-01
        will be counted as 1 unique day. 2019-01-01 and 2018-01-01
        will also be counted as 1.

    Examples:
        >>> from datetime import datetime
        >>> n_unique_days_of_month = NUniqueDaysOfMonth()
        >>> times = [datetime(2019, 1, 1),
        ...          datetime(2019, 2, 1),
        ...          datetime(2018, 2, 1),
        ...          datetime(2019, 1, 2),
        ...          datetime(2019, 1, 3)]
        >>> n_unique_days_of_month(times)
        3
    """

    name = "n_unique_days_of_month"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def n_unique_days_of_month(x):
            return x.dropna().dt.day.nunique()

        return n_unique_days_of_month


================================================
FILE: featuretools/primitives/standard/aggregation/n_unique_months.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Integer

from featuretools.primitives.base import AggregationPrimitive


class NUniqueMonths(AggregationPrimitive):
    """Determines the number of unique months.

    Description:
        Given a list of datetimes, return the number of unique months.
        NUniqueMonths counts absolute month, not month of year, so the
        same month in two different years is treated as different. (i.e.
        Feb 2017 is different than Feb 2019.)

    Examples:
        >>> from datetime import datetime
        >>> n_unique_months = NUniqueMonths()
        >>> times = [datetime(2019, 1, 1),
        ...          datetime(2019, 1, 2),
        ...          datetime(2019, 1, 3),
        ...          datetime(2019, 2, 1),
        ...          datetime(2018, 2, 1)]
        >>> n_unique_months(times)
        3
    """

    name = "n_unique_months"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def n_unique_months(x):
            return x.dt.to_period("M").nunique()

        return n_unique_months


================================================
FILE: featuretools/primitives/standard/aggregation/n_unique_weeks.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Integer

from featuretools.primitives.base import AggregationPrimitive


class NUniqueWeeks(AggregationPrimitive):
    """Determines the number of unique weeks.

    Description:
        Given a list of datetimes, return the number of unique
        weeks (Monday-Sunday). NUniqueWeeks counts by absolute
        week, not week of year, so the first week of 2018 and
        the first week of 2019 count as two unique values.

    Examples:
        >>> from datetime import datetime
        >>> n_unique_weeks = NUniqueWeeks()
        >>> times = [datetime(2018, 2, 2),
        ...          datetime(2019, 1, 1),
        ...          datetime(2019, 2, 1),
        ...          datetime(2019, 2, 1),
        ...          datetime(2019, 2, 3),
        ...          datetime(2019, 2, 21)]
        >>> n_unique_weeks(times)
        4
    """

    name = "n_unique_weeks"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def n_unique_weeks(x):
            return x.dt.to_period("W").nunique()

        return n_unique_weeks


================================================
FILE: featuretools/primitives/standard/aggregation/num_consecutive_greater_mean.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable

from featuretools.primitives.base import AggregationPrimitive


class NumConsecutiveGreaterMean(AggregationPrimitive):
    """Determines the length of the longest subsequence above the mean.

    Description:
        Given a list of numbers, find the longest subsequence of numbers
        larger than the mean of the entire sequence. Return the length
        of the longest subsequence.

    Args:
        skipna (bool): If this is False and any value in x is `NaN`, then
            the result will be `NaN`. If True, `NaN` values are skipped.
            Default is True.

    Examples:
        >>> num_consecutive_greater_mean = NumConsecutiveGreaterMean()
        >>> num_consecutive_greater_mean([1, 2, 3, 4, 5, 6])
        3.0

        We can also control the way `NaN` values are handled.

        >>> num_consecutive_greater_mean = NumConsecutiveGreaterMean(skipna=False)
        >>> num_consecutive_greater_mean([1, 2, 3, 4, 5, 6, None])
        nan
    """

    name = "num_consecutive_greater_mean"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def num_consecutive_greater_mean(x):
            # check for NaN cases
            if x.isnull().all():
                return np.nan
            if not self.skipna and x.isnull().values.any():
                return np.nan
            x_mean = x.mean()

            # In some cases, the mean of x may be NaN
            #   (such as when x has both inf and -inf values)
            if np.isnan(x.mean()):
                return np.nan

            # Find indices of points at or below mean
            x = x.dropna().reset_index(drop=True)
            below_mean_indices = x[x <= x_mean].index.to_series()

            # If none of x is below the mean, return the length of x
            if below_mean_indices.empty:
                return len(x)

            # Pad index with start/end values, in case the longest
            #   sequence occurs at the beginning or end of x
            below_mean_indices[-1] = -1
            below_mean_indices[len(x)] = len(x)
            below_mean_indices = below_mean_indices.sort_index()

            # Calculate gaps between points below mean
            below_mean_indices_shifted = below_mean_indices.shift(1)
            diffs = below_mean_indices - below_mean_indices_shifted

            # Take biggest gap, and subtract 1 to get result
            max_gap = (diffs).max() - 1
            return max_gap

        return num_consecutive_greater_mean


================================================
FILE: featuretools/primitives/standard/aggregation/num_consecutive_less_mean.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable

from featuretools.primitives.base import AggregationPrimitive


class NumConsecutiveLessMean(AggregationPrimitive):
    """Determines the length of the longest subsequence below the mean.

    Description:
        Given a list of numbers, find the longest subsequence of numbers
        smaller than the mean of the entire sequence. Return the length
        of the longest subsequence.

    Args:
        skipna (bool): If this is False and any value in x is `NaN`, then
            the result will be `NaN`. If True, `NaN` values are skipped.
            Default is True.

    Examples:
        >>> num_consecutive_less_mean = NumConsecutiveLessMean()
        >>> num_consecutive_less_mean([1, 2, 3, 4, 5, 6])
        3.0

        We can also control the way `NaN` values are handled.

        >>> num_consecutive_less_mean = NumConsecutiveLessMean(skipna=False)
        >>> num_consecutive_less_mean([1, 2, 3, 4, 5, 6, None])
        nan
    """

    name = "num_consecutive_less_mean"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def num_consecutive_less_mean(x):
            # check for NaN cases
            if x.isnull().all():
                return np.nan
            if not self.skipna and x.isnull().values.any():
                return np.nan
            x_mean = x.mean()

            # In some cases, the mean of x may be NaN
            #   (such as when x has both inf and -inf values)
            if np.isnan(x.mean()):
                return np.nan

            # Find indices of points at or above mean
            x = x.dropna().reset_index(drop=True)
            above_mean_indices = x[x >= x_mean].index.to_series()

            # If none of x is above the mean, return the length of x
            if above_mean_indices.empty:
                return len(x)

            # Pad index with start/end values, in case the longest
            #   sequence occurs at the beginning or end of x
            above_mean_indices[-1] = -1
            above_mean_indices[len(x)] = len(x)
            above_mean_indices = above_mean_indices.sort_index()

            # Calculate gaps between points above mean
            above_mean_indices_shifted = above_mean_indices.shift(1)
            diffs = above_mean_indices - above_mean_indices_shifted

            # Take biggest gap, and subtract 1 to get result
            max_gap = (diffs).max() - 1
            return max_gap

        return num_consecutive_less_mean


================================================
FILE: featuretools/primitives/standard/aggregation/num_false_since_last_true.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, IntegerNullable

from featuretools.primitives.base import AggregationPrimitive


class NumFalseSinceLastTrue(AggregationPrimitive):
    """Calculates the number of `False` values since the last `True` value.

    Description:
        From a series of Booleans, find the last record with a `True` value.
        Return the count of `False` values between that record and the end of
        the series. Return nan if no values are `True`. Any nan values in the
        input are ignored. A `True` value in the last row will result in a
        count of 0.  Inputs are converted too booleans before calculating
        the result.

    Examples:
        >>> num_false_since_last_true = NumFalseSinceLastTrue()
        >>> num_false_since_last_true([True, False, True, False, False])
        2
    """

    name = "num_false_since_last_true"
    input_types = [ColumnSchema(logical_type=Boolean)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def num_false_since_last_true(x):
            if x.empty:
                return np.nan
            x = x.dropna().astype(bool)
            true_indices = x[x]
            if true_indices.empty:
                return np.nan
            last_true_index = true_indices.index[-1]
            x_slice = x.loc[last_true_index:]
            return np.invert(x_slice).sum()

        return num_false_since_last_true


================================================
FILE: featuretools/primitives/standard/aggregation/num_peaks.py
================================================
import pandas as pd
from scipy.signal import find_peaks
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Integer

from featuretools.primitives.base import AggregationPrimitive


class NumPeaks(AggregationPrimitive):
    """Determines the number of peaks in a list of numbers.

    Description:
        Given a list of numbers, count the number of local
        maxima. Uses the find_peaks function from scipy.signal.

    Examples:
        >>> num_peaks = NumPeaks()
        >>> num_peaks([-5, 0, 10, 0, 10, -5, -4, -5, 10, 0])
        4
    """

    name = "num_peaks"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def num_peaks(x):
            if x.dtype == "Int64":
                x = x.astype("float64")
            peaks = find_peaks(x)[0]
            return len(peaks[~pd.isna(peaks)])

        return num_peaks


================================================
FILE: featuretools/primitives/standard/aggregation/num_true.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable, IntegerNullable

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class NumTrue(AggregationPrimitive):
    """Counts the number of `True` values.

    Description:
        Given a list of booleans, return the number
        of `True` values. Ignores 'NaN'.

    Examples:
        >>> num_true = NumTrue()
        >>> num_true([True, False, True, True, None])
        3
    """

    name = "num_true"
    input_types = [
        [ColumnSchema(logical_type=Boolean)],
        [ColumnSchema(logical_type=BooleanNullable)],
    ]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    default_value = 0
    stack_on = []
    stack_on_exclude = []
    description_template = "the number of times {} is true"

    def get_function(self):
        return np.sum


================================================
FILE: featuretools/primitives/standard/aggregation/num_true_since_last_false.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, IntegerNullable

from featuretools.primitives.base import AggregationPrimitive


class NumTrueSinceLastFalse(AggregationPrimitive):
    """Calculates the number of `True` values since the last `False` value.

    Description:
        From a series of Booleans, find the last record with a `False` value.
        Return the count of `True` values between that record and the end of
        the series. Return nan if no values are `False`. Any nan values in the
        input are ignored. A `False` value in the last row will result in a
        count of 0.

    Examples:
        >>> num_true_since_last_false = NumTrueSinceLastFalse()
        >>> num_true_since_last_false([False, True, False, True, True])
        2
    """

    name = "num_true_since_last_false"
    input_types = [ColumnSchema(logical_type=Boolean)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def num_true_since_last_false(x):
            x = x.dropna().astype(bool)
            false_indices = x[~x]
            if false_indices.empty:
                return np.nan
            last_false_index = false_indices.index[-1]
            x_slice = x.loc[last_false_index:]
            return x_slice.sum()

        return num_true_since_last_false


================================================
FILE: featuretools/primitives/standard/aggregation/num_unique.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class NumUnique(AggregationPrimitive):
    """Determines the number of distinct values, ignoring `NaN` values.

    Args:
        use_string_for_pd_calc (bool): Determines if the string 'nunique' or the function
            pd.Series.nunique is used for making the primitive calculation. Put in place to
            account for the bug https://github.com/pandas-dev/pandas/issues/57317.
            Defaults to using the string.

    Examples:
        >>> num_unique = NumUnique(use_string_for_pd_calc=False)
        >>> num_unique(['red', 'blue', 'green', 'yellow'])
        4

        `NaN` values will be ignored.

        >>> num_unique(['red', 'blue', 'green', 'yellow', None])
        4
    """

    name = "num_unique"
    input_types = [ColumnSchema(semantic_tags={"category"})]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    stack_on_self = False
    description_template = "the number of unique elements in {}"

    def __init__(self, use_string_for_pd_calc=True):
        self.use_string_for_pd_calc = use_string_for_pd_calc

    def get_function(self):
        if self.use_string_for_pd_calc:
            return "nunique"
        return pd.Series.nunique


================================================
FILE: featuretools/primitives/standard/aggregation/num_zero_crossings.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Integer

from featuretools.primitives.base import AggregationPrimitive


class NumZeroCrossings(AggregationPrimitive):
    """Determines the number of times a list crosses 0.

    Description:
        Given a list of numbers, return the number of times the value
        crosses 0. It is the number of times the value goes from a
        positive number to a negative number, or a negative number to
        a positive number. NaN values are ignored.

    Examples:
        >>> num_zero_crossings = NumZeroCrossings()
        >>> num_zero_crossings([1, -1, 2, -2, 3, -3])
        5
    """

    name = "num_zero_crossings"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Integer, semantic_tags={"numeric"})

    def get_function(self):
        def num_zero_crossings(x):
            cleaned = x[(x != 0) & (x == x)]
            signs = np.sign(cleaned)
            difference = np.diff(signs)
            crossings = np.where(difference)[0]
            return len(crossings)

        return num_zero_crossings


================================================
FILE: featuretools/primitives/standard/aggregation/percent_true.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable, Double

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class PercentTrue(AggregationPrimitive):
    """Determines the percent of `True` values.

    Description:
        Given a list of booleans, return the percent
        of values which are `True` as a decimal.
        `NaN` values are treated as `False`,
        adding to the denominator.

    Examples:
        >>> percent_true = PercentTrue()
        >>> percent_true([True, False, True, True, None])
        0.6
    """

    name = "percent_true"
    input_types = [
        [ColumnSchema(logical_type=BooleanNullable)],
        [ColumnSchema(logical_type=Boolean)],
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    stack_on = []
    stack_on_exclude = []
    default_value = pd.NA
    description_template = "the percentage of true values in {}"

    def get_function(self):
        def percent_true(s):
            return s.fillna(False).mean()

        return percent_true


================================================
FILE: featuretools/primitives/standard/aggregation/percent_unique.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

from featuretools.primitives.base import AggregationPrimitive


class PercentUnique(AggregationPrimitive):
    """Determines the percent of unique values.

    Description:
        Given a list of values, determine what percent of the
        list is made up of unique values.  Multiple `NaN` values
        are treated as one unique value.

    Args:
        skipna (bool): Determines whether to ignore `NaN` values.
            Defaults to True.

    Examples:
        >>> percent_unique = PercentUnique()
        >>> percent_unique([1, 1, 2, 2, 3, 4, 5, 6, 7, 8])
        0.8

        We can control whether or not `NaN` values are ignored.

        >>> percent_unique = PercentUnique()
        >>> percent_unique([1, 1, 2, None])
        0.5
        >>> percent_unique_skipna = PercentUnique(skipna=False)
        >>> percent_unique_skipna([1, 1, 2, None])
        0.75
    """

    name = "percent_unique"
    input_types = [ColumnSchema(semantic_tags={"category"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    default_value = 0

    def __init__(self, skipna=True):
        self.skipna = skipna

    def get_function(self):
        def percent_unique(x):
            return x.nunique(dropna=self.skipna) / (x.shape[0] * 1.0)

        return percent_unique


================================================
FILE: featuretools/primitives/standard/aggregation/skew.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class Skew(AggregationPrimitive):
    """Computes the extent to which a distribution differs from a normal distribution.

    Description:
        For normally distributed data, the skewness should be about 0.
        A skewness value > 0 means that there is more weight in the
        left tail of the distribution.

    Examples:
        >>> skew = Skew()
        >>> skew([1, 10, 30, None])
        1.0437603722639681
    """

    name = "skew"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    stack_on = []
    stack_on_self = False
    description_template = "the skewness of {}"

    def get_function(self):
        return pd.Series.skew


================================================
FILE: featuretools/primitives/standard/aggregation/std.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive


class Std(AggregationPrimitive):
    """Computes the dispersion relative to the mean value, ignoring `NaN`.

    Examples:
        >>> std = Std()
        >>> round(std([1, 2, 3, 4, 5, None]), 3)
        1.414
    """

    name = "std"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    stack_on_self = False
    description_template = "the standard deviation of {}"

    def get_function(self):
        return np.std


================================================
FILE: featuretools/primitives/standard/aggregation/sum_primitive.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive
from featuretools.primitives.standard.aggregation.count import Count


class Sum(AggregationPrimitive):
    """Calculates the total addition, ignoring `NaN`.

    Examples:
        >>> sum = Sum()
        >>> sum([1, 2, 3, 4, 5, None])
        15.0
    """

    name = "sum"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    stack_on_self = False
    stack_on_exclude = [Count]
    default_value = 0
    description_template = "the sum of {}"

    def get_function(self):
        return np.sum


================================================
FILE: featuretools/primitives/standard/aggregation/time_since_first.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive
from featuretools.utils import convert_time_units


class TimeSinceFirst(AggregationPrimitive):
    """Calculates the time elapsed since the first datetime (in seconds).

    Description:
        Given a list of datetimes, calculate the
        time elapsed since the first datetime (in
        seconds). Uses the instance's cutoff time.

    Args:
        unit (str): Defines the unit of time to count from.
            Defaults to seconds. Acceptable values:
            years, months, days, hours, minutes, seconds, milliseconds, nanoseconds

    Examples:
        >>> from datetime import datetime
        >>> time_since_first = TimeSinceFirst()
        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)
        >>> times = [datetime(2010, 1, 1, 11, 45, 0),
        ...          datetime(2010, 1, 1, 11, 55, 15),
        ...          datetime(2010, 1, 1, 11, 57, 30)]
        >>> time_since_first(times, time=cutoff_time)
        900.0

        >>> from datetime import datetime
        >>> time_since_first = TimeSinceFirst(unit = "minutes")
        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)
        >>> times = [datetime(2010, 1, 1, 11, 45, 0),
        ...          datetime(2010, 1, 1, 11, 55, 15),
        ...          datetime(2010, 1, 1, 11, 57, 30)]
        >>> time_since_first(times, time=cutoff_time)
        15.0

    """

    name = "time_since_first"
    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_calc_time = True
    description_template = "the time since the first {}"

    def __init__(self, unit="seconds"):
        self.unit = unit.lower()

    def get_function(self):
        def time_since_first(values, time=None):
            time_since = time - values.iloc[0]
            return convert_time_units(time_since.total_seconds(), self.unit)

        return time_since_first


================================================
FILE: featuretools/primitives/standard/aggregation/time_since_last.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive
from featuretools.utils import convert_time_units


class TimeSinceLast(AggregationPrimitive):
    """Calculates the time elapsed since the last datetime (default in seconds).

    Description:
        Given a list of datetimes, calculate the
        time elapsed since the last datetime (default in
        seconds). Uses the instance's cutoff time.

    Args:
        unit (str): Defines the unit of time to count from.
            Defaults to seconds. Acceptable values:
            years, months, days, hours, minutes, seconds, milliseconds, nanoseconds

    Examples:
        >>> from datetime import datetime
        >>> time_since_last = TimeSinceLast()
        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)
        >>> times = [datetime(2010, 1, 1, 11, 45, 0),
        ...          datetime(2010, 1, 1, 11, 55, 15),
        ...          datetime(2010, 1, 1, 11, 57, 30)]
        >>> time_since_last(times, time=cutoff_time)
        150.0

        >>> from datetime import datetime
        >>> time_since_last = TimeSinceLast(unit = "minutes")
        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)
        >>> times = [datetime(2010, 1, 1, 11, 45, 0),
        ...          datetime(2010, 1, 1, 11, 55, 15),
        ...          datetime(2010, 1, 1, 11, 57, 30)]
        >>> time_since_last(times, time=cutoff_time)
        2.5

    """

    name = "time_since_last"
    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_calc_time = True
    description_template = "the time since the last {}"

    def __init__(self, unit="seconds"):
        self.unit = unit.lower()

    def get_function(self):
        def time_since_last(values, time=None):
            time_since = time - values.iloc[-1]
            return convert_time_units(time_since.total_seconds(), self.unit)

        return time_since_last


================================================
FILE: featuretools/primitives/standard/aggregation/time_since_last_false.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable, Datetime, Double

from featuretools.primitives.base import AggregationPrimitive


class TimeSinceLastFalse(AggregationPrimitive):
    """Calculates the time since the last `False` value.

    Description:
        Using a series of Datetimes and a series of Booleans, find the last
        record with a `False` value. Return the seconds elapsed between that record
        and the instance's cutoff time. Return nan if no values are `False`.

    Examples:
        >>> from datetime import datetime
        >>> time_since_last_false = TimeSinceLastFalse()
        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)
        >>> times = [datetime(2010, 1, 1, 11, 45, 0),
        ...          datetime(2010, 1, 1, 11, 55, 15),
        ...          datetime(2010, 1, 1, 11, 57, 30)]
        >>> booleans = [True, False, True]
        >>> time_since_last_false(times, booleans, time=cutoff_time)
        285.0
    """

    name = "time_since_last_false"
    input_types = [
        [
            ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
            ColumnSchema(logical_type=Boolean),
        ],
        [
            ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
            ColumnSchema(logical_type=BooleanNullable),
        ],
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_calc_time = True
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def time_since_last_false(datetime_col, bool_col, time=None):
            df = pd.DataFrame(
                {
                    "datetime": datetime_col,
                    "bool": bool_col,
                },
            ).dropna()
            if df.empty:
                return np.nan
            false_indices = df[~df["bool"]]
            if false_indices.empty:
                return np.nan
            last_false_index = false_indices.index[-1]
            time_since = time - datetime_col.loc[last_false_index]
            return time_since.total_seconds()

        return time_since_last_false


================================================
FILE: featuretools/primitives/standard/aggregation/time_since_last_max.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base import AggregationPrimitive


class TimeSinceLastMax(AggregationPrimitive):
    """Calculates the time since the maximum value occurred.

    Description:
        Given a list of numbers, and a corresponding index of
        datetimes, find the time of the maximum value, and return
        the time elapsed since it occured. This calculation is done
        using an instance id's cutoff time.

        If multiple values equal the maximum, use the first occuring
        maximum.

    Examples:
        >>> from datetime import datetime
        >>> time_since_last_max = TimeSinceLastMax()
        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)
        >>> times = [datetime(2010, 1, 1, 11, 45, 0),
        ...          datetime(2010, 1, 1, 11, 55, 15),
        ...          datetime(2010, 1, 1, 11, 57, 30)]
        >>> time_since_last_max(times, [1, 3, 2], time=cutoff_time)
        285.0
    """

    name = "time_since_last_max"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_calc_time = True
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def time_since_last_max(datetime_col, numeric_col, time=None):
            df = pd.DataFrame(
                {
                    "datetime": datetime_col,
                    "numeric": numeric_col,
                },
            ).dropna()
            if df.empty:
                return np.nan
            max_row = df.loc[df["numeric"].idxmax()]
            max_time = max_row["datetime"]
            time_since = time - max_time
            return time_since.total_seconds()

        return time_since_last_max


================================================
FILE: featuretools/primitives/standard/aggregation/time_since_last_min.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base import AggregationPrimitive


class TimeSinceLastMin(AggregationPrimitive):
    """Calculates the time since the minimum value occurred.

    Description:
        Given a list of numbers, and a corresponding index of
        datetimes, find the time of the minimum value, and return
        the time elapsed since it occured. This calculation is done
        using an instance id's cutoff time.

        If multiple values equal the minimum, use the first occuring
        minimum.

    Examples:
        >>> from datetime import datetime
        >>> time_since_last_min = TimeSinceLastMin()
        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)
        >>> times = [datetime(2010, 1, 1, 11, 45, 0),
        ...          datetime(2010, 1, 1, 11, 55, 15),
        ...          datetime(2010, 1, 1, 11, 57, 30)]
        >>> time_since_last_min(times, [1, 3, 2], time=cutoff_time)
        900.0
    """

    name = "time_since_last_min"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_calc_time = True
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def time_since_last_min(datetime_col, numeric_col, time=None):
            df = pd.DataFrame(
                {
                    "datetime": datetime_col,
                    "numeric": numeric_col,
                },
            ).dropna()
            if df.empty:
                return np.nan
            min_row = df.loc[df["numeric"].idxmin()]
            min_time = min_row["datetime"]
            time_since = time - min_time
            return time_since.total_seconds()

        return time_since_last_min


================================================
FILE: featuretools/primitives/standard/aggregation/time_since_last_true.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable, Datetime, Double

from featuretools.primitives.base import AggregationPrimitive


class TimeSinceLastTrue(AggregationPrimitive):
    """Calculates the time since the last `True` value.

    Description:
        Using a series of Datetimes and a series of Booleans, find the last
        record with a `True` value. Return the seconds elapsed between that record
        and the instance's cutoff time. Return nan if no values are `True`.

    Examples:
        >>> from datetime import datetime
        >>> time_since_last_true = TimeSinceLastTrue()
        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)
        >>> times = [datetime(2010, 1, 1, 11, 45, 0),
        ...          datetime(2010, 1, 1, 11, 55, 15),
        ...          datetime(2010, 1, 1, 11, 57, 30)]
        >>> booleans = [True, True, False]
        >>> time_since_last_true(times, booleans, time=cutoff_time)
        285.0
    """

    name = "time_since_last_true"
    input_types = [
        [
            ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
            ColumnSchema(logical_type=Boolean),
        ],
        [
            ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
            ColumnSchema(logical_type=BooleanNullable),
        ],
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_calc_time = True
    stack_on_self = False
    default_value = 0

    def get_function(self):
        def time_since_last_true(datetime_col, bool_col, time=None):
            df = pd.DataFrame(
                {
                    "datetime": datetime_col,
                    "bool": bool_col,
                },
            ).dropna()
            if df.empty:
                return np.nan
            true_indices = df[df["bool"]]
            if true_indices.empty:
                return np.nan
            last_true_index = true_indices.index[-1]
            time_since = time - datetime_col.loc[last_true_index]
            return time_since.total_seconds()

        return time_since_last_true


================================================
FILE: featuretools/primitives/standard/aggregation/trend.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime

from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive
from featuretools.utils import calculate_trend


class Trend(AggregationPrimitive):
    """Calculates the trend of a column over time.

    Description:
        Given a list of values and a corresponding list of
        datetimes, calculate the slope of the linear trend
        of values.

    Examples:
        >>> from datetime import datetime
        >>> trend = Trend()
        >>> times = [datetime(2010, 1, 1, 11, 45, 0),
        ...          datetime(2010, 1, 1, 11, 55, 15),
        ...          datetime(2010, 1, 1, 11, 57, 30),
        ...          datetime(2010, 1, 1, 11, 12),
        ...          datetime(2010, 1, 1, 11, 12, 15)]
        >>> round(trend([1, 2, 3, 4, 5], times), 3)
        -0.053
    """

    name = "trend"
    input_types = [
        ColumnSchema(semantic_tags={"numeric"}),
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
    ]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    description_template = "the linear trend of {} over time"

    def get_function(self):
        def pd_trend(y, x):
            return calculate_trend(pd.Series(data=y.values, index=x.values))

        return pd_trend


================================================
FILE: featuretools/primitives/standard/aggregation/variance.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

from featuretools.primitives.base import AggregationPrimitive


class Variance(AggregationPrimitive):
    """Calculates the variance of a list of numbers.

    Description:
        Given a list of numbers, return the variance,
        using numpy's built-in variance function. Nan
        values in a series will be ignored. Return nan
        when the series is empty or entirely null.

    Examples:
        >>> variance = Variance()
        >>> variance([0, 3, 4, 3])
        2.25

        Null values in a series will be ignored.

        >>> variance = Variance()
        >>> variance([0, 3, 4, 3, None])
        2.25
    """

    name = "variance"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    stack_on_self = False
    default_value = np.nan

    def get_function(self):
        return np.var


================================================
FILE: featuretools/primitives/standard/transform/__init__.py
================================================
# flake8: noqa
from featuretools.primitives.standard.transform.absolute_diff import AbsoluteDiff
from featuretools.primitives.standard.transform.binary import *
from featuretools.primitives.standard.transform.cumulative import *
from featuretools.primitives.standard.transform.datetime import *
from featuretools.primitives.standard.transform.email import *
from featuretools.primitives.standard.transform.exponential import *
from featuretools.primitives.standard.transform.file_extension import FileExtension
from featuretools.primitives.standard.transform.full_name_to_first_name import (
    FullNameToFirstName,
)
from featuretools.primitives.standard.transform.full_name_to_last_name import (
    FullNameToLastName,
)
from featuretools.primitives.standard.transform.full_name_to_title import (
    FullNameToTitle,
)
from featuretools.primitives.standard.transform.nth_week_of_month import NthWeekOfMonth
from featuretools.primitives.standard.transform.is_in import IsIn
from featuretools.primitives.standard.transform.is_null import IsNull
from featuretools.primitives.standard.transform.latlong import *
from featuretools.primitives.standard.transform.natural_language import *
from featuretools.primitives.standard.transform.not_primitive import Not
from featuretools.primitives.standard.transform.numeric import *
from featuretools.primitives.standard.transform.percent_change import PercentChange
from featuretools.primitives.standard.transform.postal import *
from featuretools.primitives.standard.transform.savgol_filter import SavgolFilter
from featuretools.primitives.standard.transform.time_series import *
from featuretools.primitives.standard.transform.url import *


================================================
FILE: featuretools/primitives/standard/transform/absolute_diff.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import TransformPrimitive


class AbsoluteDiff(TransformPrimitive):
    """Calculates the absolute difference from the previous element
       in a list of numbers.

    Description:
        The absolute difference from the previous element is computed for
        all elements in the input. The first item in the output will always
        be nan, since there is no previous element for the first element.
        Elements in the input containing nan will be filled using either a
        forward-fill or backward-fill method, specified by the method argument.

    Args:
        method (str): Method to use for filling nan values in reindexed
            Series. Possible values are ['pad', 'ffill', 'backfill', 'bfill'].
            Default is 'ffill'.

            `pad / ffill`: propagate last valid observation forward
                to fill gap

            `backfill / bfill`: propagate next valid observation backward
                to fill gap

        limit (int): The max number of consecutive NaN values in a gap that
            can be filled. Default is None.

    Examples:
        >>> absolute_diff = AbsoluteDiff()
        >>> absolute_diff([2, 5, 15, 3]).tolist()
        [nan, 3.0, 10.0, 12.0]

        Forward filling of input elements using the 'ffill' argument

        >>> absolute_diff_ffill = AbsoluteDiff(method="ffill")
        >>> absolute_diff_ffill([None, 5, 10, 20, None, 10, None]).tolist()
        [nan, nan, 5.0, 10.0, 0.0, 10.0, 0.0]

        Backward filling of input element using the 'bfill' argument

        >>> absolute_diff_bfill = AbsoluteDiff(method="bfill")
        >>> absolute_diff_bfill([None, 5, 10, 20, None, 10, None]).tolist()
        [nan, 0.0, 5.0, 10.0, 10.0, 0.0, nan]

        The number of nan values that are filled can be limited

        >>> absolute_diff_limitfill = AbsoluteDiff(limit=2)
        >>> absolute_diff_limitfill([2, None, None, None, 3, 1]).tolist()
        [nan, 0.0, 0.0, nan, nan, 2.0]

    """

    name = "absolute_diff"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def __init__(self, method="ffill", limit=None):
        if method not in ["backfill", "bfill", "pad", "ffill"]:
            raise ValueError("Invalid method")
        self.method = method
        self.limit = limit

    def get_function(self):
        def absolute_diff(data):
            return data.fillna(method=self.method, limit=self.limit).diff().abs()

        return absolute_diff


================================================
FILE: featuretools/primitives/standard/transform/binary/__init__.py
================================================
from featuretools.primitives.standard.transform.binary.add_numeric import AddNumeric
from featuretools.primitives.standard.transform.binary.add_numeric_scalar import (
    AddNumericScalar,
)
from featuretools.primitives.standard.transform.binary.and_primitive import And
from featuretools.primitives.standard.transform.binary.divide_by_feature import (
    DivideByFeature,
)
from featuretools.primitives.standard.transform.binary.divide_numeric import (
    DivideNumeric,
)
from featuretools.primitives.standard.transform.binary.divide_numeric_scalar import (
    DivideNumericScalar,
)
from featuretools.primitives.standard.transform.binary.equal import Equal
from featuretools.primitives.standard.transform.binary.equal_scalar import EqualScalar
from featuretools.primitives.standard.transform.binary.greater_than import GreaterThan
from featuretools.primitives.standard.transform.binary.greater_than_equal_to import (
    GreaterThanEqualTo,
)
from featuretools.primitives.standard.transform.binary.greater_than_equal_to_scalar import (
    GreaterThanEqualToScalar,
)
from featuretools.primitives.standard.transform.binary.greater_than_scalar import (
    GreaterThanScalar,
)
from featuretools.primitives.standard.transform.binary.less_than import LessThan
from featuretools.primitives.standard.transform.binary.less_than_equal_to import (
    LessThanEqualTo,
)
from featuretools.primitives.standard.transform.binary.less_than_equal_to_scalar import (
    LessThanEqualToScalar,
)
from featuretools.primitives.standard.transform.binary.less_than_scalar import (
    LessThanScalar,
)
from featuretools.primitives.standard.transform.binary.modulo_by_feature import (
    ModuloByFeature,
)
from featuretools.primitives.standard.transform.binary.modulo_numeric import (
    ModuloNumeric,
)
from featuretools.primitives.standard.transform.binary.modulo_numeric_scalar import (
    ModuloNumericScalar,
)
from featuretools.primitives.standard.transform.binary.multiply_boolean import (
    MultiplyBoolean,
)
from featuretools.primitives.standard.transform.binary.multiply_numeric import (
    MultiplyNumeric,
)
from featuretools.primitives.standard.transform.binary.multiply_numeric_boolean import (
    MultiplyNumericBoolean,
)
from featuretools.primitives.standard.transform.binary.multiply_numeric_scalar import (
    MultiplyNumericScalar,
)
from featuretools.primitives.standard.transform.binary.not_equal import NotEqual
from featuretools.primitives.standard.transform.binary.not_equal_scalar import (
    NotEqualScalar,
)
from featuretools.primitives.standard.transform.binary.or_primitive import Or
from featuretools.primitives.standard.transform.binary.scalar_subtract_numeric_feature import (
    ScalarSubtractNumericFeature,
)
from featuretools.primitives.standard.transform.binary.subtract_numeric import (
    SubtractNumeric,
)
from featuretools.primitives.standard.transform.binary.subtract_numeric_scalar import (
    SubtractNumericScalar,
)


================================================
FILE: featuretools/primitives/standard/transform/binary/add_numeric.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class AddNumeric(TransformPrimitive):
    """Performs element-wise addition of two lists.

    Description:
        Given a list of values X and a list of values
        Y, determine the sum of each value in X with its
        corresponding value in Y.

    Examples:
        >>> add_numeric = AddNumeric()
        >>> add_numeric([2, 1, 2], [1, 2, 2]).tolist()
        [3, 3, 4]
    """

    name = "add_numeric"
    input_types = [
        ColumnSchema(semantic_tags={"numeric"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    commutative = True

    description_template = "the sum of {} and {}"

    def get_function(self):
        return np.add

    def generate_name(self, base_feature_names):
        return "%s + %s" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/add_numeric_scalar.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class AddNumericScalar(TransformPrimitive):
    """Adds a scalar to each value in the list.

    Description:
        Given a list of numeric values and a scalar, add
        the given scalar to each value in the list.

    Examples:
        >>> add_numeric_scalar = AddNumericScalar(value=2)
        >>> add_numeric_scalar([3, 1, 2]).tolist()
        [5, 3, 4]
    """

    name = "add_numeric_scalar"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def __init__(self, value=0):
        self.value = value
        self.description_template = "the sum of {{}} and {}".format(self.value)

    def get_function(self):
        def add_scalar(vals):
            return vals + self.value

        return add_scalar

    def generate_name(self, base_feature_names):
        return "%s + %s" % (base_feature_names[0], str(self.value))


================================================
FILE: featuretools/primitives/standard/transform/binary/and_primitive.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class And(TransformPrimitive):
    """Performs element-wise logical AND of two lists.

    Description:
        Given a list of booleans X and a list of booleans Y,
        determine whether each value in X is `True`, and
        whether its corresponding value in Y is also `True`.

    Examples:
        >>> _and = And()
        >>> _and([False, True, False], [True, True, False]).tolist()
        [False, True, False]
    """

    name = "and"
    input_types = [
        [
            ColumnSchema(logical_type=BooleanNullable),
            ColumnSchema(logical_type=BooleanNullable),
        ],
        [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)],
        [
            ColumnSchema(logical_type=Boolean),
            ColumnSchema(logical_type=BooleanNullable),
        ],
        [
            ColumnSchema(logical_type=BooleanNullable),
            ColumnSchema(logical_type=Boolean),
        ],
    ]
    return_type = ColumnSchema(logical_type=BooleanNullable)
    commutative = True

    description_template = "whether {} and {} are true"

    def get_function(self):
        return np.logical_and

    def generate_name(self, base_feature_names):
        return "AND(%s, %s)" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/divide_by_feature.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class DivideByFeature(TransformPrimitive):
    """Divides a scalar by each value in the list.

    Description:
        Given a list of numeric values and a scalar, divide
        the scalar by each value and return the list of
        quotients.

    Examples:
        >>> divide_by_feature = DivideByFeature(value=2)
        >>> divide_by_feature([4, 1, 2]).tolist()
        [0.5, 2.0, 1.0]
    """

    name = "divide_by_feature"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def __init__(self, value=1):
        self.value = value
        self.description_template = "the result of {} divided by {{}}".format(
            self.value,
        )

    def get_function(self):
        def divide_by_feature(vals):
            return self.value / vals

        return divide_by_feature

    def generate_name(self, base_feature_names):
        return "%s / %s" % (str(self.value), base_feature_names[0])


================================================
FILE: featuretools/primitives/standard/transform/binary/divide_numeric.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class DivideNumeric(TransformPrimitive):
    """Performs element-wise division of two lists.

    Description:
        Given a list of values X and a list of values
        Y, determine the quotient of each value in X
        divided by its corresponding value in Y.

    Args:
        commutative (bool): determines if Deep Feature Synthesis should
            generate both x / y and y / x, or just one. If True, there is
            no guarantee which of the two will be generated. Defaults to False.

    Examples:
        >>> divide_numeric = DivideNumeric()
        >>> divide_numeric([2.0, 1.0, 2.0], [1.0, 2.0, 2.0]).tolist()
        [2.0, 0.5, 1.0]
    """

    name = "divide_numeric"
    input_types = [
        ColumnSchema(semantic_tags={"numeric"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    description_template = "the result of {} divided by {}"

    def __init__(self, commutative=False):
        self.commutative = commutative

    def get_function(self):
        def divide_numeric(val1, val2):
            return val1 / val2

        return divide_numeric

    def generate_name(self, base_feature_names):
        return "%s / %s" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/divide_numeric_scalar.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class DivideNumericScalar(TransformPrimitive):
    """Divides each element in the list by a scalar.

    Description:
        Given a list of numeric values and a scalar, divide
        each value in the list by the scalar.

    Examples:
        >>> divide_numeric_scalar = DivideNumericScalar(value=2)
        >>> divide_numeric_scalar([3, 1, 2]).tolist()
        [1.5, 0.5, 1.0]
    """

    name = "divide_numeric_scalar"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def __init__(self, value=1):
        self.value = value
        self.description_template = "the result of {{}} divided by {}".format(
            self.value,
        )

    def get_function(self):
        def divide_scalar(vals):
            return vals / self.value

        return divide_scalar

    def generate_name(self, base_feature_names):
        return "%s / %s" % (base_feature_names[0], str(self.value))


================================================
FILE: featuretools/primitives/standard/transform/binary/equal.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class Equal(TransformPrimitive):
    """Determines if values in one list are equal to another list.

    Description:
        Given a list of values X and a list of values Y, determine
        whether each value in X is equal to each corresponding value
        in Y.

    Examples:
        >>> equal = Equal()
        >>> equal([2, 1, 2], [1, 2, 2]).tolist()
        [False, False, True]
    """

    name = "equal"
    input_types = [ColumnSchema(), ColumnSchema()]
    return_type = ColumnSchema(logical_type=BooleanNullable)
    commutative = True

    description_template = "whether {} equals {}"

    def get_function(self):
        def equal(x_vals, y_vals):
            if isinstance(x_vals.dtype, pd.CategoricalDtype) and isinstance(
                y_vals.dtype,
                pd.CategoricalDtype,
            ):
                categories = set(x_vals.cat.categories).union(
                    set(y_vals.cat.categories),
                )
                x_vals = x_vals.cat.add_categories(
                    categories.difference(set(x_vals.cat.categories)),
                )
                y_vals = y_vals.cat.add_categories(
                    categories.difference(set(y_vals.cat.categories)),
                )
            return x_vals.eq(y_vals)

        return equal

    def generate_name(self, base_feature_names):
        return "%s = %s" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/equal_scalar.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class EqualScalar(TransformPrimitive):
    """Determines if values in a list are equal to a given scalar.

    Description:
        Given a list of values and a constant scalar, determine
        whether each of the values is equal to the scalar.

    Examples:
        >>> equal_scalar = EqualScalar(value=2)
        >>> equal_scalar([3, 1, 2]).tolist()
        [False, False, True]
    """

    name = "equal_scalar"
    input_types = [ColumnSchema()]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    def __init__(self, value=None):
        self.value = value
        self.description_template = "whether {{}} equals {}".format(self.value)

    def get_function(self):
        def equal_scalar(vals):
            return vals == self.value

        return equal_scalar

    def generate_name(self, base_feature_names):
        return "%s = %s" % (base_feature_names[0], str(self.value))


================================================
FILE: featuretools/primitives/standard/transform/binary/greater_than.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime, Ordinal

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class GreaterThan(TransformPrimitive):
    """Determines if values in one list are greater than another list.

    Description:
        Given a list of values X and a list of values Y, determine
        whether each value in X is greater than each corresponding
        value in Y. Equal pairs will return `False`.

    Examples:
        >>> greater_than = GreaterThan()
        >>> greater_than([2, 1, 2], [1, 2, 2]).tolist()
        [True, False, False]
    """

    name = "greater_than"
    input_types = [
        [
            ColumnSchema(semantic_tags={"numeric"}),
            ColumnSchema(semantic_tags={"numeric"}),
        ],
        [ColumnSchema(logical_type=Datetime), ColumnSchema(logical_type=Datetime)],
        [ColumnSchema(logical_type=Ordinal), ColumnSchema(logical_type=Ordinal)],
    ]
    return_type = ColumnSchema(logical_type=BooleanNullable)
    description_template = "whether {} is greater than {}"

    def get_function(self):
        def greater_than(val1, val2):
            val1_is_categorical = isinstance(val1.dtype, pd.CategoricalDtype)
            val2_is_categorical = isinstance(val2.dtype, pd.CategoricalDtype)
            if val1_is_categorical and val2_is_categorical:
                if not all(val1.cat.categories == val2.cat.categories):
                    return val1.where(pd.isnull, np.nan)
            elif val1_is_categorical or val2_is_categorical:
                # This can happen because CFM does not set proper dtypes for intermediate
                # features, so some agg features that should be Ordinal don't yet have correct type.
                return val1.where(pd.isnull, np.nan)
            return val1 > val2

        return greater_than

    def generate_name(self, base_feature_names):
        return "%s > %s" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/greater_than_equal_to.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime, Ordinal

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class GreaterThanEqualTo(TransformPrimitive):
    """Determines if values in one list are greater than or equal to another list.

    Description:
        Given a list of values X and a list of values Y, determine
        whether each value in X is greater than or equal to each
        corresponding value in Y. Equal pairs will return `True`.

    Examples:
        >>> greater_than_equal_to = GreaterThanEqualTo()
        >>> greater_than_equal_to([2, 1, 2], [1, 2, 2]).tolist()
        [True, False, True]
    """

    name = "greater_than_equal_to"
    input_types = [
        [
            ColumnSchema(semantic_tags={"numeric"}),
            ColumnSchema(semantic_tags={"numeric"}),
        ],
        [ColumnSchema(logical_type=Datetime), ColumnSchema(logical_type=Datetime)],
        [ColumnSchema(logical_type=Ordinal), ColumnSchema(logical_type=Ordinal)],
    ]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    description_template = "whether {} is greater than or equal to {}"

    def get_function(self):
        def greater_than_equal(val1, val2):
            val1_is_categorical = isinstance(val1.dtype, pd.CategoricalDtype)
            val2_is_categorical = isinstance(val2.dtype, pd.CategoricalDtype)
            if val1_is_categorical and val2_is_categorical:
                if not all(val1.cat.categories == val2.cat.categories):
                    return val1.where(pd.isnull, np.nan)
            elif val1_is_categorical or val2_is_categorical:
                # This can happen because CFM does not set proper dtypes for intermediate
                # features, so some agg features that should be Ordinal don't yet have correct type.
                return val1.where(pd.isnull, np.nan)
            return val1 >= val2

        return greater_than_equal

    def generate_name(self, base_feature_names):
        return "%s >= %s" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/greater_than_equal_to_scalar.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class GreaterThanEqualToScalar(TransformPrimitive):
    """Determines if values are greater than or equal to a given scalar.

    Description:
        Given a list of values and a constant scalar, determine
        whether each of the values is greater than or equal to the
        scalar. If a value is equal to the scalar, return `True`.

    Examples:
        >>> greater_than_equal_to_scalar = GreaterThanEqualToScalar(value=2)
        >>> greater_than_equal_to_scalar([3, 1, 2]).tolist()
        [True, False, True]
    """

    name = "greater_than_equal_to_scalar"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    def __init__(self, value=0):
        self.value = value
        self.description_template = (
            "whether {{}} is greater than or equal to {}".format(self.value)
        )

    def get_function(self):
        def greater_than_equal_to_scalar(vals):
            return vals >= self.value

        return greater_than_equal_to_scalar

    def generate_name(self, base_feature_names):
        return "%s >= %s" % (base_feature_names[0], str(self.value))


================================================
FILE: featuretools/primitives/standard/transform/binary/greater_than_scalar.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class GreaterThanScalar(TransformPrimitive):
    """Determines if values are greater than a given scalar.

    Description:
        Given a list of values and a constant scalar, determine
        whether each of the values is greater than the scalar.
        If a value is equal to the scalar, return `False`.

    Examples:
        >>> greater_than_scalar = GreaterThanScalar(value=2)
        >>> greater_than_scalar([3, 1, 2]).tolist()
        [True, False, False]
    """

    name = "greater_than_scalar"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    def __init__(self, value=0):
        self.value = value
        self.description_template = "whether {{}} is greater than {}".format(self.value)

    def get_function(self):
        def greater_than_scalar(vals):
            return vals > self.value

        return greater_than_scalar

    def generate_name(self, base_feature_names):
        return "%s > %s" % (base_feature_names[0], str(self.value))


================================================
FILE: featuretools/primitives/standard/transform/binary/less_than.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime, Ordinal

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class LessThan(TransformPrimitive):
    """Determines if values in one list are less than another list.

    Description:
        Given a list of values X and a list of values Y, determine
        whether each value in X is less than each corresponding value
        in Y. Equal pairs will return `False`.

    Examples:
        >>> less_than = LessThan()
        >>> less_than([2, 1, 2], [1, 2, 2]).tolist()
        [False, True, False]
    """

    name = "less_than"
    input_types = [
        [
            ColumnSchema(semantic_tags={"numeric"}),
            ColumnSchema(semantic_tags={"numeric"}),
        ],
        [ColumnSchema(logical_type=Datetime), ColumnSchema(logical_type=Datetime)],
        [ColumnSchema(logical_type=Ordinal), ColumnSchema(logical_type=Ordinal)],
    ]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    description_template = "whether {} is less than {}"

    def get_function(self):
        def less_than(val1, val2):
            val1_is_categorical = isinstance(val1.dtype, pd.CategoricalDtype)
            val2_is_categorical = isinstance(val2.dtype, pd.CategoricalDtype)
            if val1_is_categorical and val2_is_categorical:
                if not all(val1.cat.categories == val2.cat.categories):
                    return val1.where(pd.isnull, np.nan)
            elif val1_is_categorical or val2_is_categorical:
                # This can happen because CFM does not set proper dtypes for intermediate
                # features, so some agg features that should be Ordinal don't yet have correct type.
                return val1.where(pd.isnull, np.nan)
            return val1 < val2

        return less_than

    def generate_name(self, base_feature_names):
        return "%s < %s" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/less_than_equal_to.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime, Ordinal

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class LessThanEqualTo(TransformPrimitive):
    """Determines if values in one list are less than or equal to another list.

    Description:
        Given a list of values X and a list of values Y, determine
        whether each value in X is less than or equal to each
        corresponding value in Y. Equal pairs will return `True`.

    Examples:
        >>> less_than_equal_to = LessThanEqualTo()
        >>> less_than_equal_to([2, 1, 2], [1, 2, 2]).tolist()
        [False, True, True]
    """

    name = "less_than_equal_to"
    input_types = [
        [
            ColumnSchema(semantic_tags={"numeric"}),
            ColumnSchema(semantic_tags={"numeric"}),
        ],
        [ColumnSchema(logical_type=Datetime), ColumnSchema(logical_type=Datetime)],
        [ColumnSchema(logical_type=Ordinal), ColumnSchema(logical_type=Ordinal)],
    ]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    description_template = "whether {} is less than or equal to {}"

    def get_function(self):
        def less_than_equal(val1, val2):
            val1_is_categorical = isinstance(val1.dtype, pd.CategoricalDtype)
            val2_is_categorical = isinstance(val2.dtype, pd.CategoricalDtype)
            if val1_is_categorical and val2_is_categorical:
                if not all(val1.cat.categories == val2.cat.categories):
                    return val1.where(pd.isnull, np.nan)
            elif val1_is_categorical or val2_is_categorical:
                # This can happen because CFM does not set proper dtypes for intermediate
                # features, so some agg features that should be Ordinal don't yet have correct type.
                return val1.where(pd.isnull, np.nan)
            return val1 <= val2

        return less_than_equal

    def generate_name(self, base_feature_names):
        return "%s <= %s" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/less_than_equal_to_scalar.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class LessThanEqualToScalar(TransformPrimitive):
    """Determines if values are less than or equal to a given scalar.

    Description:
        Given a list of values and a constant scalar, determine
        whether each of the values is less than or equal to the
        scalar. If a value is equal to the scalar, return `True`.

    Examples:
        >>> less_than_equal_to_scalar = LessThanEqualToScalar(value=2)
        >>> less_than_equal_to_scalar([3, 1, 2]).tolist()
        [False, True, True]
    """

    name = "less_than_equal_to_scalar"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    def __init__(self, value=0):
        self.value = value
        self.description_template = "whether {{}} is less than or equal to {}".format(
            self.value,
        )

    def get_function(self):
        def less_than_equal_to_scalar(vals):
            return vals <= self.value

        return less_than_equal_to_scalar

    def generate_name(self, base_feature_names):
        return "%s <= %s" % (base_feature_names[0], str(self.value))


================================================
FILE: featuretools/primitives/standard/transform/binary/less_than_scalar.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class LessThanScalar(TransformPrimitive):
    """Determines if values are less than a given scalar.

    Description:
        Given a list of values and a constant scalar, determine
        whether each of the values is less than the scalar.
        If a value is equal to the scalar, return `False`.

    Examples:
        >>> less_than_scalar = LessThanScalar(value=2)
        >>> less_than_scalar([3, 1, 2]).tolist()
        [False, True, False]
    """

    name = "less_than_scalar"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    def __init__(self, value=0):
        self.value = value
        self.description_template = "whether {{}} is less than {}".format(self.value)

    def get_function(self):
        def less_than_scalar(vals):
            return vals < self.value

        return less_than_scalar

    def generate_name(self, base_feature_names):
        return "%s < %s" % (base_feature_names[0], str(self.value))


================================================
FILE: featuretools/primitives/standard/transform/binary/modulo_by_feature.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class ModuloByFeature(TransformPrimitive):
    """Computes the modulo of a scalar by each element in a list.

    Description:
        Given a list of numeric values and a scalar, return the
        modulo, or remainder of the scalar after being divided
        by each value.

    Examples:
        >>> modulo_by_feature = ModuloByFeature(value=2)
        >>> modulo_by_feature([4, 1, 2]).tolist()
        [2, 0, 0]
    """

    name = "modulo_by_feature"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def __init__(self, value=1):
        self.value = value
        self.description_template = "the remainder after dividing {} by {{}}".format(
            self.value,
        )

    def get_function(self):
        def modulo_by_feature(vals):
            return self.value % vals

        return modulo_by_feature

    def generate_name(self, base_feature_names):
        return "%s %% %s" % (str(self.value), base_feature_names[0])


================================================
FILE: featuretools/primitives/standard/transform/binary/modulo_numeric.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class ModuloNumeric(TransformPrimitive):
    """Performs element-wise modulo of two lists.

    Description:
        Given a list of values X and a list of values Y,
        determine the modulo, or remainder of each value in
        X after it's divided by its corresponding value in Y.

    Examples:
        >>> modulo_numeric = ModuloNumeric()
        >>> modulo_numeric([2, 1, 5], [1, 2, 2]).tolist()
        [0, 1, 1]
    """

    name = "modulo_numeric"
    input_types = [
        ColumnSchema(semantic_tags={"numeric"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    description_template = "the remainder after dividing {} by {}"

    def get_function(self):
        return np.mod

    def generate_name(self, base_feature_names):
        return "%s %% %s" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/modulo_numeric_scalar.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class ModuloNumericScalar(TransformPrimitive):
    """Computes the modulo of each element in the list by a given scalar.

    Description:
        Given a list of numeric values and a scalar, return
        the modulo, or remainder of each value after being
        divided by the scalar.

    Examples:
        >>> modulo_numeric_scalar = ModuloNumericScalar(value=2)
        >>> modulo_numeric_scalar([3, 1, 2]).tolist()
        [1, 1, 0]
    """

    name = "modulo_numeric_scalar"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def __init__(self, value=1):
        self.value = value
        self.description_template = "the remainder after dividing {{}} by {}".format(
            self.value,
        )

    def get_function(self):
        def modulo_scalar(vals):
            return vals % self.value

        return modulo_scalar

    def generate_name(self, base_feature_names):
        return "%s %% %s" % (base_feature_names[0], str(self.value))


================================================
FILE: featuretools/primitives/standard/transform/binary/multiply_boolean.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class MultiplyBoolean(TransformPrimitive):
    """Performs element-wise multiplication of two lists of boolean values.

    Description:
        Given a list of boolean values X and a list of boolean
        values Y, determine the product of each value in X
        with its corresponding value in Y.

    Examples:
        >>> multiply_boolean = MultiplyBoolean()
        >>> multiply_boolean([True, True, False], [True, False, True]).tolist()
        [True, False, False]
    """

    name = "multiply_boolean"
    input_types = [
        [
            ColumnSchema(logical_type=BooleanNullable),
            ColumnSchema(logical_type=BooleanNullable),
        ],
        [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)],
        [
            ColumnSchema(logical_type=Boolean),
            ColumnSchema(logical_type=BooleanNullable),
        ],
        [
            ColumnSchema(logical_type=BooleanNullable),
            ColumnSchema(logical_type=Boolean),
        ],
    ]
    return_type = ColumnSchema(logical_type=BooleanNullable)
    commutative = True
    description_template = "the product of {} and {}"

    def get_function(self):
        return np.bitwise_and

    def generate_name(self, base_feature_names):
        return "%s * %s" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/multiply_numeric.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class MultiplyNumeric(TransformPrimitive):
    """Performs element-wise multiplication of two lists.

    Description:
        Given a list of values X and a list of values
        Y, determine the product of each value in X
        with its corresponding value in Y.

    Examples:
        >>> multiply_numeric = MultiplyNumeric()
        >>> multiply_numeric([2, 1, 2], [1, 2, 2]).tolist()
        [2, 2, 4]
    """

    name = "multiply_numeric"
    input_types = [
        ColumnSchema(semantic_tags={"numeric"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    commutative = True

    description_template = "the product of {} and {}"

    def get_function(self):
        return np.multiply

    def generate_name(self, base_feature_names):
        return "%s * %s" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/multiply_numeric_boolean.py
================================================
import pandas.api.types as pdtypes
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class MultiplyNumericBoolean(TransformPrimitive):
    """Performs element-wise multiplication of a numeric list with a boolean list.

    Description:
        Given a list of numeric values X and a list of
        boolean values Y, return the values in X where
        the corresponding value in Y is True.

    Examples:
        >>> import pandas as pd
        >>> multiply_numeric_boolean = MultiplyNumericBoolean()
        >>> multiply_numeric_boolean([2, 1, 2], [True, True, False]).tolist()
        [2, 1, 0]
        >>> multiply_numeric_boolean([2, None, None], [True, True, False]).astype("float64").tolist()
        [2.0, nan, nan]
        >>> multiply_numeric_boolean([2, 1, 2], pd.Series([True, True, pd.NA], dtype="boolean")).tolist()
        [2, 1, <NA>]
    """

    name = "multiply_numeric_boolean"
    input_types = [
        [
            ColumnSchema(semantic_tags={"numeric"}),
            ColumnSchema(logical_type=Boolean),
        ],
        [
            ColumnSchema(semantic_tags={"numeric"}),
            ColumnSchema(logical_type=BooleanNullable),
        ],
        [
            ColumnSchema(logical_type=Boolean),
            ColumnSchema(semantic_tags={"numeric"}),
        ],
        [
            ColumnSchema(logical_type=BooleanNullable),
            ColumnSchema(semantic_tags={"numeric"}),
        ],
    ]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    commutative = True
    description_template = "the product of {} and {}"

    def get_function(self):
        def multiply_numeric_boolean(ser1, ser2):
            if pdtypes.is_bool_dtype(ser1):
                bools = ser1
                vals = ser2
            else:
                bools = ser2
                vals = ser1
            result = vals * bools.astype("Int64")
            return result

        return multiply_numeric_boolean

    def generate_name(self, base_feature_names):
        return "%s * %s" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/multiply_numeric_scalar.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class MultiplyNumericScalar(TransformPrimitive):
    """Multiplies each element in the list by a scalar.

    Description:
        Given a list of numeric values and a scalar, multiply
        each value in the list by the scalar.

    Examples:
        >>> multiply_numeric_scalar = MultiplyNumericScalar(value=2)
        >>> multiply_numeric_scalar([3, 1, 2]).tolist()
        [6, 2, 4]
    """

    name = "multiply_numeric_scalar"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def __init__(self, value=1):
        self.value = value
        self.description_template = "the product of {{}} and {}".format(self.value)

    def get_function(self):
        def multiply_scalar(vals):
            return vals * self.value

        return multiply_scalar

    def generate_name(self, base_feature_names):
        return "%s * %s" % (base_feature_names[0], str(self.value))


================================================
FILE: featuretools/primitives/standard/transform/binary/not_equal.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class NotEqual(TransformPrimitive):
    """Determines if values in one list are not equal to another list.

    Description:
        Given a list of values X and a list of values Y, determine
        whether each value in X is not equal to each corresponding
        value in Y.

    Examples:
        >>> not_equal = NotEqual()
        >>> not_equal([2, 1, 2], [1, 2, 2]).tolist()
        [True, True, False]
    """

    name = "not_equal"
    input_types = [ColumnSchema(), ColumnSchema()]
    return_type = ColumnSchema(logical_type=BooleanNullable)
    commutative = True
    description_template = "whether {} does not equal {}"

    def get_function(self):
        def not_equal(x_vals, y_vals):
            if isinstance(x_vals.dtype, pd.CategoricalDtype) and isinstance(
                y_vals.dtype,
                pd.CategoricalDtype,
            ):
                categories = set(x_vals.cat.categories).union(
                    set(y_vals.cat.categories),
                )
                x_vals = x_vals.cat.add_categories(
                    categories.difference(set(x_vals.cat.categories)),
                )
                y_vals = y_vals.cat.add_categories(
                    categories.difference(set(y_vals.cat.categories)),
                )
            return x_vals.ne(y_vals)

        return not_equal

    def generate_name(self, base_feature_names):
        return "%s != %s" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/not_equal_scalar.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class NotEqualScalar(TransformPrimitive):
    """Determines if values in a list are not equal to a given scalar.

    Description:
        Given a list of values and a constant scalar, determine
        whether each of the values is not equal to the scalar.

    Examples:
        >>> not_equal_scalar = NotEqualScalar(value=2)
        >>> not_equal_scalar([3, 1, 2]).tolist()
        [True, True, False]
    """

    name = "not_equal_scalar"
    input_types = [ColumnSchema()]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    def __init__(self, value=None):
        self.value = value
        self.description_template = "whether {{}} does not equal {}".format(self.value)

    def get_function(self):
        def not_equal_scalar(vals):
            return vals != self.value

        return not_equal_scalar

    def generate_name(self, base_feature_names):
        return "%s != %s" % (base_feature_names[0], str(self.value))


================================================
FILE: featuretools/primitives/standard/transform/binary/or_primitive.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class Or(TransformPrimitive):
    """Performs element-wise logical OR of two lists.

    Description:
        Given a list of booleans X and a list of booleans Y,
        determine whether each value in X is `True`, or
        whether its corresponding value in Y is `True`.

    Examples:
        >>> _or = Or()
        >>> _or([False, True, False], [True, True, False]).tolist()
        [True, True, False]
    """

    name = "or"
    input_types = [
        [
            ColumnSchema(logical_type=BooleanNullable),
            ColumnSchema(logical_type=BooleanNullable),
        ],
        [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)],
        [
            ColumnSchema(logical_type=Boolean),
            ColumnSchema(logical_type=BooleanNullable),
        ],
        [
            ColumnSchema(logical_type=BooleanNullable),
            ColumnSchema(logical_type=Boolean),
        ],
    ]
    return_type = ColumnSchema(logical_type=BooleanNullable)
    commutative = True

    description_template = "whether {} is true or {} is true"

    def get_function(self):
        return np.logical_or

    def generate_name(self, base_feature_names):
        return "OR(%s, %s)" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/scalar_subtract_numeric_feature.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class ScalarSubtractNumericFeature(TransformPrimitive):
    """Subtracts each value in the list from a given scalar.

    Description:
        Given a list of numeric values and a scalar, subtract
        the each value from the scalar and return the list of
        differences.

    Examples:
        >>> scalar_subtract_numeric_feature = ScalarSubtractNumericFeature(value=2)
        >>> scalar_subtract_numeric_feature([3, 1, 2]).tolist()
        [-1, 1, 0]
    """

    name = "scalar_subtract_numeric_feature"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def __init__(self, value=0):
        self.value = value
        self.description_template = "the result {} minus {{}}".format(self.value)

    def get_function(self):
        def scalar_subtract_numeric_feature(vals):
            return self.value - vals

        return scalar_subtract_numeric_feature

    def generate_name(self, base_feature_names):
        return "%s - %s" % (str(self.value), base_feature_names[0])


================================================
FILE: featuretools/primitives/standard/transform/binary/subtract_numeric.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class SubtractNumeric(TransformPrimitive):
    """Performs element-wise subtraction of two lists.

    Description:
        Given a list of values X and a list of values
        Y, determine the difference of each value
        in X from its corresponding value in Y.

    Args:
        commutative (bool): determines if Deep Feature Synthesis should
            generate both x - y and y - x, or just one. If True, there is no
            guarantee which of the two will be generated. Defaults to True.

    Notes:
        commutative is True by default since False would result in 2 perfectly
        correlated series.

    Examples:
        >>> subtract_numeric = SubtractNumeric()
        >>> subtract_numeric([2, 1, 2], [1, 2, 2]).tolist()
        [1, -1, 0]
    """

    name = "subtract_numeric"
    input_types = [
        ColumnSchema(semantic_tags={"numeric"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    description_template = "the result of {} minus {}"
    commutative = True

    def __init__(self, commutative=True):
        self.commutative = commutative

    def get_function(self):
        return np.subtract

    def generate_name(self, base_feature_names):
        return "%s - %s" % (base_feature_names[0], base_feature_names[1])


================================================
FILE: featuretools/primitives/standard/transform/binary/subtract_numeric_scalar.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive


class SubtractNumericScalar(TransformPrimitive):
    """Subtracts a scalar from each element in the list.

    Description:
        Given a list of numeric values and a scalar, subtract
        the given scalar from each value in the list.

    Examples:
        >>> subtract_numeric_scalar = SubtractNumericScalar(value=2)
        >>> subtract_numeric_scalar([3, 1, 2]).tolist()
        [1, -1, 0]
    """

    name = "subtract_numeric_scalar"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    def __init__(self, value=0):
        self.value = value
        self.description_template = "the result of {{}} minus {}".format(self.value)

    def get_function(self):
        def subtract_scalar(vals):
            return vals - self.value

        return subtract_scalar

    def generate_name(self, base_feature_names):
        return "%s - %s" % (base_feature_names[0], str(self.value))


================================================
FILE: featuretools/primitives/standard/transform/cumulative/__init__.py
================================================
from featuretools.primitives.standard.transform.cumulative.cum_count import CumCount
from featuretools.primitives.standard.transform.cumulative.cum_max import CumMax
from featuretools.primitives.standard.transform.cumulative.cum_mean import CumMean
from featuretools.primitives.standard.transform.cumulative.cum_min import CumMin
from featuretools.primitives.standard.transform.cumulative.cum_sum import CumSum
from featuretools.primitives.standard.transform.cumulative.cumulative_time_since_last_false import (
    CumulativeTimeSinceLastFalse,
)
from featuretools.primitives.standard.transform.cumulative.cumulative_time_since_last_true import (
    CumulativeTimeSinceLastTrue,
)


================================================
FILE: featuretools/primitives/standard/transform/cumulative/cum_count.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable

from featuretools.primitives.base import TransformPrimitive


class CumCount(TransformPrimitive):
    """Calculates the cumulative count.

    Description:
        Given a list of values, return the cumulative count
        (or running count). There is no set window, so the
        count at each point is calculated over all prior
        values. `NaN` values are counted.

    Examples:
        >>> cum_count = CumCount()
        >>> cum_count([1, 2, 3, 4, None, 5]).tolist()
        [1, 2, 3, 4, 5, 6]
    """

    name = "cum_count"
    input_types = [
        [ColumnSchema(semantic_tags={"foreign_key"})],
        [ColumnSchema(semantic_tags={"category"})],
    ]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    uses_full_dataframe = True
    description_template = "the cumulative count of {}"

    def get_function(self):
        def cum_count(values):
            return np.arange(1, len(values) + 1)

        return cum_count


================================================
FILE: featuretools/primitives/standard/transform/cumulative/cum_max.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import TransformPrimitive


class CumMax(TransformPrimitive):
    """Calculates the cumulative maximum.

    Description:
        Given a list of values, return the cumulative max
        (or running max). There is no set window, so the max
        at each point is calculated over all prior values.
        `NaN` values will return `NaN`, but in the window of a
        cumulative caluclation, they're ignored.

    Examples:
        >>> cum_max = CumMax()
        >>> cum_max([1, 2, 3, 4, None, 5]).tolist()
        [1.0, 2.0, 3.0, 4.0, nan, 5.0]
    """

    name = "cum_max"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    uses_full_dataframe = True
    description_template = "the cumulative maximum of {}"

    def get_function(self):
        def cum_max(values):
            return values.cummax()

        return cum_max


================================================
FILE: featuretools/primitives/standard/transform/cumulative/cum_mean.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import TransformPrimitive


class CumMean(TransformPrimitive):
    """Calculates the cumulative mean.

    Description:
        Given a list of values, return the cumulative mean
        (or running mean). There is no set window, so the
        mean at each point is calculated over all prior values.
        `NaN` values will return `NaN`, but in the window of a
        cumulative caluclation, they're treated as 0.

    Examples:
        >>> cum_mean = CumMean()
        >>> cum_mean([1, 2, 3, 4, None, 5]).tolist()
        [1.0, 1.5, 2.0, 2.5, nan, 2.5]
    """

    name = "cum_mean"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    uses_full_dataframe = True
    description_template = "the cumulative mean of {}"

    def get_function(self):
        def cum_mean(values):
            return values.cumsum() / np.arange(1, len(values) + 1)

        return cum_mean


================================================
FILE: featuretools/primitives/standard/transform/cumulative/cum_min.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import TransformPrimitive


class CumMin(TransformPrimitive):
    """Calculates the cumulative minimum.

    Description:
        Given a list of values, return the cumulative min
        (or running min). There is no set window, so the min
        at each point is calculated over all prior values.
        `NaN` values will return `NaN`, but in the window of a
        cumulative caluclation, they're ignored.

    Examples:
        >>> cum_min = CumMin()
        >>> cum_min([1, 2, -3, 4, None, 5]).tolist()
        [1.0, 1.0, -3.0, -3.0, nan, -3.0]
    """

    name = "cum_min"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    uses_full_dataframe = True
    description_template = "the cumulative minimum of {}"

    def get_function(self):
        def cum_min(values):
            return values.cummin()

        return cum_min


================================================
FILE: featuretools/primitives/standard/transform/cumulative/cum_sum.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import TransformPrimitive


class CumSum(TransformPrimitive):
    """Calculates the cumulative sum.

    Description:
        Given a list of values, return the cumulative sum
        (or running total). There is no set window, so the
        sum at each point is calculated over all prior values.
        `NaN` values will return `NaN`, but in the window of a
        cumulative caluclation, they're ignored.

    Examples:
        >>> cum_sum = CumSum()
        >>> cum_sum([1, 2, 3, 4, None, 5]).tolist()
        [1.0, 3.0, 6.0, 10.0, nan, 15.0]
    """

    name = "cum_sum"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    uses_full_dataframe = True
    description_template = "the cumulative sum of {}"

    def get_function(self):
        def cum_sum(values):
            return values.cumsum()

        return cum_sum


================================================
FILE: featuretools/primitives/standard/transform/cumulative/cumulative_time_since_last_false.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, Datetime, Double

from featuretools.primitives.base import TransformPrimitive


class CumulativeTimeSinceLastFalse(TransformPrimitive):
    """Determines the time since last `False` value.

    Description:
        Given a list of booleans and a list of corresponding
        datetimes, determine the time at each point since the
        last `False` value. Returns time difference in seconds.
        `NaN` values are ignored.

    Examples:
        >>> from datetime import datetime
        >>> cumulative_time_since_last_false = CumulativeTimeSinceLastFalse()
        >>> booleans = [False, True, False, True]
        >>> datetimes = [
        ...     datetime(2011, 4, 9, 10, 30, 0),
        ...     datetime(2011, 4, 9, 10, 30, 10),
        ...     datetime(2011, 4, 9, 10, 30, 15),
        ...     datetime(2011, 4, 9, 10, 30, 29)
        ... ]
        >>> cumulative_time_since_last_false(datetimes, booleans).tolist()
        [0.0, 10.0, 0.0, 14.0]
    """

    name = "cumulative_time_since_last_false"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(logical_type=Boolean),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})

    def get_function(self):
        def time_since_previous_false(datetime_col, bool_col):
            if bool_col.dropna().empty:
                return pd.Series([np.nan] * len(bool_col))
            df = pd.DataFrame(
                {
                    "datetime": datetime_col,
                    "last_false_datetime": datetime_col,
                    "bool": bool_col,
                },
            )
            not_false_indices = df["bool"]
            df.loc[not_false_indices, "last_false_datetime"] = np.nan
            df["last_false_datetime"] = df["last_false_datetime"].fillna(method="ffill")
            total_seconds = (
                pd.to_datetime(df["datetime"]).subtract(df["last_false_datetime"])
            ).dt.total_seconds()
            return pd.Series(total_seconds)

        return time_since_previous_false


================================================
FILE: featuretools/primitives/standard/transform/cumulative/cumulative_time_since_last_true.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, Datetime, Double

from featuretools.primitives.base import TransformPrimitive


class CumulativeTimeSinceLastTrue(TransformPrimitive):
    """Determines the time (in seconds) since the last boolean was `True`
    given a datetime index column and boolean column

    Examples:
        >>> from datetime import datetime
        >>> cumulative_time_since_last_true = CumulativeTimeSinceLastTrue()
        >>> booleans = [False, True, False, True]
        >>> datetimes = [
        ...     datetime(2011, 4, 9, 10, 30, 0),
        ...     datetime(2011, 4, 9, 10, 30, 10),
        ...     datetime(2011, 4, 9, 10, 30, 15),
        ...     datetime(2011, 4, 9, 10, 30, 30)
        ... ]
        >>> cumulative_time_since_last_true(datetimes, booleans).tolist()
        [nan, 0.0, 5.0, 0.0]
    """

    name = "cumulative_time_since_last_true"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(logical_type=Boolean),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})

    def get_function(self):
        def time_since_previous_true(datetime_col, bool_col):
            if bool_col.dropna().empty:
                return pd.Series([np.nan] * len(bool_col))
            df = pd.DataFrame(
                {
                    "datetime": datetime_col,
                    "last_true_datetime": datetime_col,
                    "bool": bool_col,
                },
            )
            not_false_indices = df["bool"]
            df.loc[~not_false_indices, "last_true_datetime"] = np.nan
            df["last_true_datetime"] = df["last_true_datetime"].fillna(method="ffill")
            total_seconds = (
                pd.to_datetime(df["datetime"]).subtract(df["last_true_datetime"])
            ).dt.total_seconds()
            return pd.Series(total_seconds)

        return time_since_previous_true


================================================
FILE: featuretools/primitives/standard/transform/datetime/__init__.py
================================================
from featuretools.primitives.standard.transform.datetime.age import Age
from featuretools.primitives.standard.transform.datetime.date_to_holiday import (
    DateToHoliday,
)
from featuretools.primitives.standard.transform.datetime.date_to_timezone import (
    DateToTimeZone,
)
from featuretools.primitives.standard.transform.datetime.day import Day
from featuretools.primitives.standard.transform.datetime.day_of_year import DayOfYear
from featuretools.primitives.standard.transform.datetime.days_in_month import (
    DaysInMonth,
)
from featuretools.primitives.standard.transform.datetime.diff_datetime import (
    DiffDatetime,
)
from featuretools.primitives.standard.transform.datetime.distance_to_holiday import (
    DistanceToHoliday,
)
from featuretools.primitives.standard.transform.datetime.hour import Hour
from featuretools.primitives.standard.transform.datetime.is_first_week_of_month import (
    IsFirstWeekOfMonth,
)
from featuretools.primitives.standard.transform.datetime.is_federal_holiday import (
    IsFederalHoliday,
)
from featuretools.primitives.standard.transform.datetime.is_leap_year import IsLeapYear
from featuretools.primitives.standard.transform.datetime.is_lunch_time import (
    IsLunchTime,
)
from featuretools.primitives.standard.transform.datetime.is_month_end import IsMonthEnd
from featuretools.primitives.standard.transform.datetime.is_month_start import (
    IsMonthStart,
)
from featuretools.primitives.standard.transform.datetime.is_quarter_end import (
    IsQuarterEnd,
)
from featuretools.primitives.standard.transform.datetime.is_quarter_start import (
    IsQuarterStart,
)
from featuretools.primitives.standard.transform.datetime.is_weekend import IsWeekend
from featuretools.primitives.standard.transform.datetime.is_working_hours import (
    IsWorkingHours,
)
from featuretools.primitives.standard.transform.datetime.is_year_end import IsYearEnd
from featuretools.primitives.standard.transform.datetime.is_year_start import (
    IsYearStart,
)
from featuretools.primitives.standard.transform.datetime.minute import Minute
from featuretools.primitives.standard.transform.datetime.month import Month
from featuretools.primitives.standard.transform.datetime.part_of_day import PartOfDay
from featuretools.primitives.standard.transform.datetime.quarter import Quarter
from featuretools.primitives.standard.transform.datetime.season import Season
from featuretools.primitives.standard.transform.datetime.second import Second
from featuretools.primitives.standard.transform.datetime.time_since import TimeSince
from featuretools.primitives.standard.transform.datetime.time_since_previous import (
    TimeSincePrevious,
)
from featuretools.primitives.standard.transform.datetime.week import Week
from featuretools.primitives.standard.transform.datetime.weekday import Weekday
from featuretools.primitives.standard.transform.datetime.year import Year


================================================
FILE: featuretools/primitives/standard/transform/datetime/age.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import AgeFractional, Datetime

from featuretools.primitives.base import TransformPrimitive


class Age(TransformPrimitive):
    """Calculates the age in years as a floating point number given a
       date of birth.

    Description:
        Age in years is computed by calculating the number of days between
        the date of birth and the reference time and dividing the result
        by 365.

    Examples:
        Determine the age of three people as of Jan 1, 2019
        >>> import pandas as pd
        >>> reference_date = pd.to_datetime("01-01-2019")
        >>> age = Age()
        >>> input_ages = [pd.to_datetime("01-01-2000"),
        ...               pd.to_datetime("05-30-1983"),
        ...               pd.to_datetime("10-17-1997")]
        >>> age(input_ages, time=reference_date).tolist()
        [19.013698630136986, 35.61643835616438, 21.221917808219178]
    """

    name = "age"
    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"date_of_birth"})]
    return_type = ColumnSchema(logical_type=AgeFractional, semantic_tags={"numeric"})
    uses_calc_time = True
    description_template = "the age from {}"

    def get_function(self):
        def age(x, time=None):
            return (time - x).dt.days / 365

        return age


================================================
FILE: featuretools/primitives/standard/transform/datetime/date_to_holiday.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Categorical, Datetime

from featuretools.primitives.base import TransformPrimitive
from featuretools.primitives.standard.transform.datetime.utils import HolidayUtil


class DateToHoliday(TransformPrimitive):
    """Transforms time of an instance into the holiday name, if there is one.

    Description:
        If there is no holiday, it returns `NaN`. Currently only works for the
        United States and Canada with dates between 1950 and 2100.

    Args:
        country (str): Country to use for determining Holidays.
            Default is 'US'. Should be one of the available countries here:
            https://github.com/dr-prodigy/python-holidays#available-countries

    Examples:
        >>> from datetime import datetime
        >>> date_to_holiday = DateToHoliday()
        >>> dates = pd.Series([datetime(2016, 1, 1),
        ...          datetime(2016, 2, 27),
        ...          datetime(2017, 5, 29, 10, 30, 5),
        ...          datetime(2018, 7, 4)])
        >>> date_to_holiday(dates).tolist()
        ["New Year's Day", nan, 'Memorial Day', 'Independence Day']

        We can also change the country.

        >>> date_to_holiday_canada = DateToHoliday(country='Canada')
        >>> dates = pd.Series([datetime(2016, 7, 1),
        ...          datetime(2016, 11, 15),
        ...          datetime(2018, 12, 25)])
        >>> date_to_holiday_canada(dates).tolist()
        ['Canada Day', nan, 'Christmas Day']
    """

    name = "date_to_holiday"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})

    def __init__(self, country="US"):
        self.country = country
        self.holidayUtil = HolidayUtil(country)

    def get_function(self):
        def date_to_holiday(x):
            holiday_df = self.holidayUtil.to_df()
            df = pd.DataFrame({"date": x})
            df["date"] = df["date"].dt.date.astype("datetime64[ns]")

            df = df.merge(
                holiday_df,
                how="left",
                left_on="date",
                right_on="holiday_date",
            )
            return df.names.values

        return date_to_holiday


================================================
FILE: featuretools/primitives/standard/transform/datetime/date_to_timezone.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Categorical, Datetime

from featuretools.primitives.base import TransformPrimitive


class DateToTimeZone(TransformPrimitive):
    """Determines the timezone of a datetime.

    Description:
        Given a list of datetimes, extract the timezone from each
        one. Looks for the `tzinfo` attribute on `datetime.datetime`
        objects. If the datetime has no timezone or the date is
        missing, return `NaN`.

    Examples:
        >>> from datetime import datetime
        >>> from pytz import timezone
        >>> date_to_time_zone = DateToTimeZone()
        >>> dates = [datetime(2010, 1, 1, tzinfo=timezone("America/Los_Angeles")),
        ...          datetime(2010, 1, 1, tzinfo=timezone("America/New_York")),
        ...          datetime(2010, 1, 1, tzinfo=timezone("America/Chicago")),
        ...          datetime(2010, 1, 1)]
        >>> date_to_time_zone(dates).tolist()
        ['America/Los_Angeles', 'America/New_York', 'America/Chicago', nan]
    """

    name = "date_to_time_zone"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})

    def get_function(self):
        def date_to_time_zone(x):
            return x.apply(lambda x: x.tzinfo.zone if x.tzinfo else np.nan)

        return date_to_time_zone


================================================
FILE: featuretools/primitives/standard/transform/datetime/day.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Ordinal

from featuretools.primitives.base import TransformPrimitive


class Day(TransformPrimitive):
    """Determines the day of the month from a datetime.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 3, 1),
        ...          datetime(2019, 3, 3),
        ...          datetime(2019, 3, 31)]
        >>> day = Day()
        >>> day(dates).tolist()
        [1, 3, 31]
    """

    name = "day"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(
        logical_type=Ordinal(order=list(range(1, 32))),
        semantic_tags={"category"},
    )

    description_template = "the day of the month of {}"

    def get_function(self):
        def day(vals):
            return vals.dt.day

        return day


================================================
FILE: featuretools/primitives/standard/transform/datetime/day_of_year.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Ordinal

from featuretools.primitives.base import TransformPrimitive


class DayOfYear(TransformPrimitive):
    """Determines the ordinal day of the year from the given datetime

    Description:
        For a list of dates, return the ordinal day of the year
        from the given datetime.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 1, 1),
        ...          datetime(2020, 12, 31),
        ...          datetime(2020, 2, 28)]
        >>> dayOfYear = DayOfYear()
        >>> dayOfYear(dates).tolist()
        [1, 366, 59]
    """

    name = "day_of_year"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(
        logical_type=Ordinal(order=list(range(1, 367))),
        semantic_tags={"category"},
    )

    description_template = "the day of year from {}"

    def get_function(self):
        def dayOfYear(vals):
            return vals.dt.dayofyear

        return dayOfYear


================================================
FILE: featuretools/primitives/standard/transform/datetime/days_in_month.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Ordinal

from featuretools.primitives.base import TransformPrimitive


class DaysInMonth(TransformPrimitive):
    """Determines the number of days in the month of given datetime.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 12, 1),
        ...          datetime(2019, 1, 3),
        ...          datetime(2020, 2, 1)]
        >>> days_in_month = DaysInMonth()
        >>> days_in_month(dates).tolist()
        [31, 31, 29]
    """

    name = "days_in_month"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(
        logical_type=Ordinal(order=list(range(1, 32))),
        semantic_tags={"category"},
    )

    description_template = "the days in the month of {}"

    def get_function(self):
        def days_in_month(vals):
            return vals.dt.daysinmonth

        return days_in_month


================================================
FILE: featuretools/primitives/standard/transform/datetime/diff_datetime.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Timedelta

from featuretools.primitives.standard.transform.numeric.diff import Diff


class DiffDatetime(Diff):
    """Computes the timedelta between a datetime in a list and the
    previous datetime in that list.

    Args:
        periods (int): The number of periods by which to shift the index row.
            Default is 0. Periods correspond to rows.

    Description:
        Given a list of datetimes, compute the difference from the previous
        item in the list. The result for the first element of the list will
        always be `NaT`.

    Examples:
        >>> from datetime import datetime
        >>> dt_values = [datetime(2019, 3, 1), datetime(2019, 6, 30), datetime(2019, 11, 17), datetime(2020, 1, 30), datetime(2020, 3, 11)]
        >>> diff_dt = DiffDatetime()
        >>> diff_dt(dt_values).tolist()
        [NaT, Timedelta('121 days 00:00:00'), Timedelta('140 days 00:00:00'), Timedelta('74 days 00:00:00'), Timedelta('41 days 00:00:00')]

        You can specify the number of periods to shift the values

        >>> diff_dt_periods = DiffDatetime(periods = 1)
        >>> diff_dt_periods(dt_values).tolist()
        [NaT, NaT, Timedelta('121 days 00:00:00'), Timedelta('140 days 00:00:00'), Timedelta('74 days 00:00:00')]
    """

    name = "diff_datetime"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=Timedelta)
    uses_full_dataframe = True
    description_template = "the difference from the previous value of {}"

    def __init__(self, periods=0):
        super().__init__(periods)


================================================
FILE: featuretools/primitives/standard/transform/datetime/distance_to_holiday.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime

from featuretools.primitives.base import TransformPrimitive
from featuretools.primitives.standard.transform.datetime.utils import HolidayUtil


class DistanceToHoliday(TransformPrimitive):
    """Computes the number of days before or after a given holiday.

    Description:
        For a list of dates, return the distance from the nearest
        occurrence of a chosen holiday. The distance is returned in
        days. If the closest occurrence is prior to the date given,
        return a negative number.

        If a date is missing, return `NaN`.

        Currently only works with dates between 1950 and 2100.

    Args:
        holiday (str): Name of the holiday. Defaults to New Year's Day.

        country (str): Specifies which country's calendar to use for the
            given holiday. Default is `US`.

    Examples:
        >>> from datetime import datetime
        >>> distance_to_holiday = DistanceToHoliday("New Year's Day")
        >>> dates = [datetime(2010, 1, 1),
        ...          datetime(2012, 5, 31),
        ...          datetime(2017, 7, 31),
        ...          datetime(2020, 12, 31)]
        >>> distance_to_holiday(dates).tolist()
        [0, -151, 154, 1]

        We can also control the country in which we're searching for
            a holiday.

        >>> distance_to_holiday = DistanceToHoliday("Canada Day", country='Canada')
        >>> dates = [datetime(2010, 1, 1),
        ...          datetime(2012, 5, 31),
        ...          datetime(2017, 7, 31),
        ...          datetime(2020, 12, 31)]
        >>> distance_to_holiday(dates).tolist()
        [181, 31, -30, 182]
    """

    name = "distance_to_holiday"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    default_value = 0

    def __init__(self, holiday="New Year's Day", country="US"):
        self.country = country
        self.holiday = holiday
        self.holidayUtil = HolidayUtil(country)

        available_holidays = list(set(self.holidayUtil.federal_holidays.values()))
        if self.holiday not in available_holidays:
            error = "must be one of the available holidays:\n%s" % available_holidays
            raise ValueError(error)

    def get_function(self):
        def distance_to_holiday(x):
            holiday_df = self.holidayUtil.to_df()
            holiday_df = holiday_df[holiday_df.names == self.holiday]

            df = pd.DataFrame({"date": x})
            df["x_index"] = df.index  # store original index as a column
            df = df.dropna()
            df = df.sort_values("date")
            df["date"] = df["date"].dt.date.astype("datetime64[ns]")

            matches = pd.merge_asof(
                df,
                holiday_df,
                left_on="date",
                right_on="holiday_date",
                direction="nearest",
                tolerance=pd.Timedelta("365d"),
            )
            matches = matches.set_index("x_index")
            matches["days_diff"] = (matches.holiday_date - matches.date).dt.days

            return matches.days_diff.reindex_like(x)

        return distance_to_holiday


================================================
FILE: featuretools/primitives/standard/transform/datetime/hour.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Ordinal

from featuretools.primitives.base import TransformPrimitive


class Hour(TransformPrimitive):
    """Determines the hour value of a datetime.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 3, 1),
        ...          datetime(2019, 3, 3, 11, 10, 50),
        ...          datetime(2019, 3, 31, 19, 45, 15)]
        >>> hour = Hour()
        >>> hour(dates).tolist()
        [0, 11, 19]
    """

    name = "hour"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(
        logical_type=Ordinal(order=list(range(24))),
        semantic_tags={"category"},
    )

    description_template = "the hour value of {}"

    def get_function(self):
        def hour(vals):
            return vals.dt.hour

        return hour


================================================
FILE: featuretools/primitives/standard/transform/datetime/is_federal_holiday.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime

from featuretools.primitives.base import TransformPrimitive
from featuretools.primitives.standard.transform.datetime.utils import HolidayUtil


class IsFederalHoliday(TransformPrimitive):
    """Determines if a given datetime is a federal holiday.

    Description:
        This primtive currently only works for the United States
        and Canada with dates between 1950 and 2100.

    Args:
        country (str): Country to use for determining Holidays.
            Default is 'US'. Should be one of the available countries here:
            https://github.com/dr-prodigy/python-holidays#available-countries

    Examples:
        >>> from datetime import datetime
        >>> is_federal_holiday = IsFederalHoliday(country="US")
        >>> is_federal_holiday([
        ...     datetime(2019, 7, 4, 10, 0, 30),
        ...     datetime(2019, 2, 26)]).tolist()
        [True, False]
    """

    name = "is_federal_holiday"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    def __init__(self, country="US"):
        self.country = country
        self.holidayUtil = HolidayUtil(country)

    def get_function(self):
        def is_federal_holiday(x):
            holidays_df = self.holidayUtil.to_df()
            is_holiday = x.dt.normalize().isin(holidays_df.holiday_date)
            if x.isnull().values.any():
                is_holiday = is_holiday.astype("object")
                is_holiday[x.isnull()] = np.nan
            return is_holiday.values

        return is_federal_holiday


================================================
FILE: featuretools/primitives/standard/transform/datetime/is_first_week_of_month.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime

from featuretools.primitives.base import TransformPrimitive


class IsFirstWeekOfMonth(TransformPrimitive):
    """Determines if a date falls in the first week of the month.

    Description:
        Converts a datetime to a boolean indicating if the date
        falls in the first week of the month. The first week of
        the month starts on day 1, and the week number is incremented
        each Sunday.

    Examples:
        >>> from datetime import datetime
        >>> is_first_week_of_month = IsFirstWeekOfMonth()
        >>> times = [datetime(2019, 3, 1),
        ...          datetime(2019, 3, 3),
        ...          datetime(2019, 3, 31),
        ...          datetime(2019, 3, 30)]
        >>> is_first_week_of_month(times).tolist()
        [True, False, False, False]
    """

    name = "is_first_week_of_month"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    def get_function(self):
        def is_first_week_of_month(x):
            df = pd.DataFrame({"date": x})
            df["first_day"] = df.date - pd.to_timedelta(df["date"].dt.day - 1, unit="d")
            df["dom"] = df.date.dt.day
            df["first_day_weekday"] = df.first_day.dt.weekday
            df["adjusted_dom"] = df.dom + df.first_day_weekday + 1
            df.loc[df["first_day_weekday"].astype(float) == 6.0, "adjusted_dom"] = df[
                "dom"
            ]
            df["is_first_week"] = np.ceil(df.adjusted_dom / 7.0) == 1.0
            if df["date"].isnull().values.any():
                df["is_first_week"] = df["is_first_week"].astype("object")
                df.loc[df["date"].isnull(), "is_first_week"] = np.nan
            return df.is_first_week.values

        return is_first_week_of_month


================================================
FILE: featuretools/primitives/standard/transform/datetime/is_leap_year.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime

from featuretools.primitives.base import TransformPrimitive


class IsLeapYear(TransformPrimitive):
    """Determines the is_leap_year attribute of a datetime column.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 3, 1),
        ...          datetime(2020, 3, 3, 11, 10, 50),
        ...          datetime(2021, 3, 31, 19, 45, 15)]
        >>> ily = IsLeapYear()
        >>> ily(dates).tolist()
        [False, True, False]
    """

    name = "is_leap_year"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    description_template = "whether the year of {} is a leap year"

    def get_function(self):
        def is_leap_year(vals):
            return vals.dt.is_leap_year

        return is_leap_year


================================================
FILE: featuretools/primitives/standard/transform/datetime/is_lunch_time.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime

from featuretools.primitives.base import TransformPrimitive


class IsLunchTime(TransformPrimitive):
    """Determines if a datetime falls during configurable lunch hour, on a 24-hour clock.

    Args:
        lunch_hour (int): Hour when lunch is taken. Must adhere to 24-hour clock. Defaults to 12.

    Examples:
        >>> import numpy as np
        >>> from datetime import datetime
        >>> dates = [datetime(2022, 6, 21, 12, 3, 3),
        ...          datetime(2019, 1, 3, 4, 4, 4),
        ...          datetime(2022, 1, 1, 11, 1, 2),
        ...          np.nan]
        >>> is_lunch_time = IsLunchTime()
        >>> is_lunch_time(dates).tolist()
        [True, False, False, False]
        >>> is_lunch_time = IsLunchTime(11)
        >>> is_lunch_time(dates).tolist()
        [False, False, True, False]
    """

    name = "is_lunch_time"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    description_template = "whether {} falls during lunch time"

    def __init__(self, lunch_hour=12):
        self.lunch_hour = lunch_hour

    def get_function(self):
        def is_lunch_time(vals):
            return vals.dt.hour == self.lunch_hour

        return is_lunch_time


================================================
FILE: featuretools/primitives/standard/transform/datetime/is_month_end.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime

from featuretools.primitives.base import TransformPrimitive


class IsMonthEnd(TransformPrimitive):
    """Determines the is_month_end attribute of a datetime column.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 3, 1),
        ...          datetime(2021, 2, 28),
        ...          datetime(2020, 2, 29)]
        >>> ime = IsMonthEnd()
        >>> ime(dates).tolist()
        [False, True, True]
    """

    name = "is_month_end"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    description_template = "whether {} is at the end of a month"

    def get_function(self):
        def is_month_end(vals):
            return vals.dt.is_month_end

        return is_month_end


================================================
FILE: featuretools/primitives/standard/transform/datetime/is_month_start.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime

from featuretools.primitives.base import TransformPrimitive


class IsMonthStart(TransformPrimitive):
    """Determines the is_month_start attribute of a datetime column.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 3, 1),
        ...          datetime(2020, 2, 13),
        ...          datetime(2020, 2, 29)]
        >>> ims = IsMonthStart()
        >>> ims(dates).tolist()
        [True, False, False]
    """

    name = "is_month_start"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    description_template = "whether {} is at the start of a month"

    def get_function(self):
        def is_month_start(vals):
            return vals.dt.is_month_start

        return is_month_start


================================================
FILE: featuretools/primitives/standard/transform/datetime/is_quarter_end.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime

from featuretools.primitives.base import TransformPrimitive


class IsQuarterEnd(TransformPrimitive):
    """Determines the is_quarter_end attribute of a datetime column.

    Examples:
        >>> from datetime import datetime
        >>> iqe = IsQuarterEnd()
        >>> dates = [datetime(2020, 3, 31),
        ...          datetime(2020, 1, 1)]
        >>> iqe(dates).tolist()
        [True, False]
    """

    name = "is_quarter_end"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    description_template = "whether {} is a quarter end"

    def get_function(self):
        def is_quarter_end(vals):
            return vals.dt.is_quarter_end

        return is_quarter_end


================================================
FILE: featuretools/primitives/standard/transform/datetime/is_quarter_start.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime

from featuretools.primitives.base import TransformPrimitive


class IsQuarterStart(TransformPrimitive):
    """Determines the is_quarter_start attribute of a datetime column.

    Examples:
        >>> from datetime import datetime
        >>> iqs = IsQuarterStart()
        >>> dates = [datetime(2020, 3, 31),
        ...          datetime(2020, 1, 1)]
        >>> iqs(dates).tolist()
        [False, True]
    """

    name = "is_quarter_start"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    description_template = "whether {} is a quarter start"

    def get_function(self):
        def is_quarter_start(vals):
            return vals.dt.is_quarter_start

        return is_quarter_start


================================================
FILE: featuretools/primitives/standard/transform/datetime/is_weekend.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime

from featuretools.primitives.base import TransformPrimitive


class IsWeekend(TransformPrimitive):
    """Determines if a date falls on a weekend.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 3, 1),
        ...          datetime(2019, 6, 17, 11, 10, 50),
        ...          datetime(2019, 11, 30, 19, 45, 15)]
        >>> is_weekend = IsWeekend()
        >>> is_weekend(dates).tolist()
        [False, False, True]
    """

    name = "is_weekend"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    description_template = "whether {} occurred on a weekend"

    def get_function(self):
        def is_weekend(vals):
            return vals.dt.weekday > 4

        return is_weekend


================================================
FILE: featuretools/primitives/standard/transform/datetime/is_working_hours.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime

from featuretools.primitives.base import TransformPrimitive


class IsWorkingHours(TransformPrimitive):
    """Determines if a datetime falls during working hours on a 24-hour clock. Can configure start_hour and end_hour.

    Args:
        start_hour (int): Start hour of workday. Must adhere to 24-hour clock. Default is 8 (8am).
        end_hour (int): End hour of workday. Must adhere to 24-hour clock. Default is 18 (6pm).

    Examples:
        >>> import numpy as np
        >>> from datetime import datetime
        >>> dates = [datetime(2022, 6, 21, 16, 3, 3),
        ...          datetime(2019, 1, 3, 4, 4, 4),
        ...          datetime(2022, 1, 1, 12, 1, 2),
        ...          np.nan]
        >>> is_working_hour = IsWorkingHours()
        >>> is_working_hour(dates).tolist()
        [True, False, True, False]
        >>> is_working_hour = IsWorkingHours(15, 17)
        >>> is_working_hour(dates).tolist()
        [True, False, False, False]
    """

    name = "is_working_hours"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    description_template = "whether {} falls during working hours"

    def __init__(self, start_hour=8, end_hour=18):
        self.start_hour = start_hour
        self.end_hour = end_hour

    def get_function(self):
        def is_working_hours(vals):
            return (vals.dt.hour >= self.start_hour) & (vals.dt.hour <= self.end_hour)

        return is_working_hours


================================================
FILE: featuretools/primitives/standard/transform/datetime/is_year_end.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime

from featuretools.primitives.base import TransformPrimitive


class IsYearEnd(TransformPrimitive):
    """Determines if a date falls on the end of a year.

    Examples:
        >>> import numpy as np
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 12, 31),
        ...          datetime(2019, 1, 1),
        ...          datetime(2019, 11, 30),
        ...          np.nan]
        >>> is_year_end = IsYearEnd()
        >>> is_year_end(dates).tolist()
        [True, False, False, False]
    """

    name = "is_year_end"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    description_template = "whether {} occurred on the end of a year"

    def get_function(self):
        def is_year_end(vals):
            return vals.dt.is_year_end

        return is_year_end


================================================
FILE: featuretools/primitives/standard/transform/datetime/is_year_start.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, Datetime

from featuretools.primitives.base import TransformPrimitive


class IsYearStart(TransformPrimitive):
    """Determines if a date falls on the start of a year.

    Examples:
        >>> import numpy as np
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 12, 31),
        ...          datetime(2019, 1, 1),
        ...          datetime(2019, 11, 30),
        ...          np.nan]
        >>> is_year_start = IsYearStart()
        >>> is_year_start(dates).tolist()
        [False, True, False, False]
    """

    name = "is_year_start"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    description_template = "whether {} occurred on the start of a year"

    def get_function(self):
        def is_year_start(vals):
            return vals.dt.is_year_start

        return is_year_start


================================================
FILE: featuretools/primitives/standard/transform/datetime/minute.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Ordinal

from featuretools.primitives.base import TransformPrimitive


class Minute(TransformPrimitive):
    """Determines the minutes value of a datetime.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 3, 1),
        ...          datetime(2019, 3, 3, 11, 10, 50),
        ...          datetime(2019, 3, 31, 19, 45, 15)]
        >>> minute = Minute()
        >>> minute(dates).tolist()
        [0, 10, 45]
    """

    name = "minute"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(
        logical_type=Ordinal(order=list(range(60))),
        semantic_tags={"category"},
    )

    description_template = "the minutes value of {}"

    def get_function(self):
        def minute(vals):
            return vals.dt.minute

        return minute


================================================
FILE: featuretools/primitives/standard/transform/datetime/month.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Ordinal

from featuretools.primitives.base import TransformPrimitive


class Month(TransformPrimitive):
    """Determines the month value of a datetime.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 3, 1),
        ...          datetime(2019, 6, 17, 11, 10, 50),
        ...          datetime(2019, 11, 30, 19, 45, 15)]
        >>> month = Month()
        >>> month(dates).tolist()
        [3, 6, 11]
    """

    name = "month"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(
        logical_type=Ordinal(order=list(range(1, 13))),
        semantic_tags={"category"},
    )

    description_template = "the month of {}"

    def get_function(self):
        def month(vals):
            return vals.dt.month

        return month


================================================
FILE: featuretools/primitives/standard/transform/datetime/part_of_day.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Categorical, Datetime

from featuretools.primitives.base import TransformPrimitive


class PartOfDay(TransformPrimitive):
    """Determines the part of day of a datetime.

    Description:
        For a list of datetimes, determines the part of day the datetime
        falls into, based on the hour.
        If the hour falls from 4 to 5, the part of day is 'dawn'.
        If the hour falls from 6 to 7, the part of day is 'early morning'.
        If the hour falls from 8 to 10, the part of day is 'late morning'.
        If the hour falls from 11 to 13, the part of day is 'noon'.
        If the hour falls from 14 to 16, the part of day is 'afternoon'.
        If the hour falls from 17 to 19, the part of day is 'evening'.
        If the hour falls from 20 to 22, the part of day is 'night'.
        If the hour falls into 23, 24, or 1 to 3, the part of day is 'midnight'.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2020, 1, 11, 6, 2, 1),
        ...          datetime(2021, 3, 31, 4, 2, 1),
        ...          datetime(2020, 3, 4, 9, 2, 1)]
        >>> part_of_day = PartOfDay()
        >>> part_of_day(dates).tolist()
        ['early morning', 'dawn', 'late morning']
    """

    name = "part_of_day"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})

    description_template = "the part of day {} falls in"

    @staticmethod
    def construct_replacement_dict():
        tdict = dict()
        tdict[pd.NaT] = np.nan
        for hour in [4, 5]:
            tdict[hour] = "dawn"
        for hour in [6, 7]:
            tdict[hour] = "early morning"
        for hour in [8, 9, 10]:
            tdict[hour] = "late morning"
        for hour in [11, 12, 13]:
            tdict[hour] = "noon"
        for hour in [14, 15, 16]:
            tdict[hour] = "afternoon"
        for hour in [17, 18, 19]:
            tdict[hour] = "evening"
        for hour in [20, 21, 22]:
            tdict[hour] = "night"
        for hour in [23, 0, 1, 2, 3]:
            tdict[hour] = "midnight"
        return tdict

    def get_function(self):
        replacement_dict = self.construct_replacement_dict()

        def part_of_day(vals):
            ans = vals.dt.hour.replace(replacement_dict)
            return ans

        return part_of_day


================================================
FILE: featuretools/primitives/standard/transform/datetime/quarter.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Ordinal

from featuretools.primitives.base import TransformPrimitive


class Quarter(TransformPrimitive):
    """Determines the quarter a datetime column falls into (1, 2, 3, 4)

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019,12,1),
        ...          datetime(2019,1,3),
        ...          datetime(2020,2,1)]
        >>> q = Quarter()
        >>> q(dates).tolist()
        [4, 1, 1]
    """

    name = "quarter"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(
        logical_type=Ordinal(order=list(range(1, 5))),
        semantic_tags={"category"},
    )

    description_template = "the quarter that describes {}"

    def get_function(self):
        def quarter(vals):
            return vals.dt.quarter

        return quarter


================================================
FILE: featuretools/primitives/standard/transform/datetime/season.py
================================================
from datetime import date

import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Categorical, Datetime

from featuretools.primitives.base import TransformPrimitive


class Season(TransformPrimitive):
    """Determines the season of a given datetime.
        Returns winter, spring, summer, or fall.
        This only works for northern hemisphere.

    Description:
        Given a list of datetimes, return the season of each one
        (`winter`, `spring`, `summer`, or `fall`).

    Examples:
        >>> from datetime import datetime
        >>> times = [datetime(2019, 1, 1),
        ...          datetime(2019, 4, 15),
        ...          datetime(2019, 7, 20),
        ...          datetime(2019, 12, 30)]
        >>> season = Season()
        >>> season(times).tolist()
        ['winter', 'spring', 'summer', 'winter']
    """

    name = "season"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})

    def get_function(self):
        def season(x):
            # https://stackoverflow.com/a/28688724/2512385
            Y = 2000  # dummy leap year to allow input X-02-29 (leap day)
            seasons = [
                ("winter", (date(Y, 1, 1), date(Y, 3, 20))),
                ("spring", (date(Y, 3, 21), date(Y, 6, 20))),
                ("summer", (date(Y, 6, 21), date(Y, 9, 22))),
                ("fall", (date(Y, 9, 23), date(Y, 12, 20))),
                ("winter", (date(Y, 12, 21), date(Y, 12, 31))),
            ]
            x = x.apply(lambda x: x.replace(year=2000))

            def get_season(dt):
                for season, (start, end) in seasons:
                    if not pd.isna(dt) and start <= dt.date() <= end:
                        return season
                return pd.NA

            new = x.apply(get_season).astype(dtype="string")
            return new

        return season


================================================
FILE: featuretools/primitives/standard/transform/datetime/second.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Ordinal

from featuretools.primitives.base import TransformPrimitive


class Second(TransformPrimitive):
    """Determines the seconds value of a datetime.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 3, 1),
        ...          datetime(2019, 3, 3, 11, 10, 50),
        ...          datetime(2019, 3, 31, 19, 45, 15)]
        >>> second = Second()
        >>> second(dates).tolist()
        [0, 50, 15]
    """

    name = "second"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(
        logical_type=Ordinal(order=list(range(60))),
        semantic_tags={"category"},
    )

    description_template = "the seconds value of {}"

    def get_function(self):
        def second(vals):
            return vals.dt.second

        return second


================================================
FILE: featuretools/primitives/standard/transform/datetime/time_since.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime

from featuretools.primitives.base import TransformPrimitive
from featuretools.utils import convert_time_units


class TimeSince(TransformPrimitive):
    """Calculates time from a value to a specified cutoff datetime.

    Args:
        unit (str): Defines the unit of time to count from.
            Defaults to Seconds. Acceptable values:
            years, months, days, hours, minutes, seconds, milliseconds, nanoseconds

    Examples:
        >>> from datetime import datetime
        >>> time_since = TimeSince()
        >>> times = [datetime(2019, 3, 1, 0, 0, 0, 1),
        ...          datetime(2019, 3, 1, 0, 0, 1, 0),
        ...          datetime(2019, 3, 1, 0, 2, 0, 0)]
        >>> cutoff_time = datetime(2019, 3, 1, 0, 0, 0, 0)
        >>> values = time_since(times, time=cutoff_time)
        >>> list(map(int, values))
        [0, -1, -120]

        Change output to nanoseconds

        >>> from datetime import datetime
        >>> time_since_nano = TimeSince(unit='nanoseconds')
        >>> times = [datetime(2019, 3, 1, 0, 0, 0, 1),
        ...          datetime(2019, 3, 1, 0, 0, 1, 0),
        ...          datetime(2019, 3, 1, 0, 2, 0, 0)]
        >>> cutoff_time = datetime(2019, 3, 1, 0, 0, 0, 0)
        >>> values = time_since_nano(times, time=cutoff_time)
        >>> list(map(lambda x: int(round(x)), values))
        [-1000, -1000000000, -120000000000]
    """

    name = "time_since"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    uses_calc_time = True
    description_template = "the time from {} to the cutoff time"

    def __init__(self, unit="seconds"):
        self.unit = unit.lower()

    def get_function(self):
        def pd_time_since(array, time):
            return convert_time_units((time - array).dt.total_seconds(), self.unit)

        return pd_time_since


================================================
FILE: featuretools/primitives/standard/transform/datetime/time_since_previous.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime

from featuretools.primitives.base import TransformPrimitive
from featuretools.utils import convert_time_units


class TimeSincePrevious(TransformPrimitive):
    """Computes the time since the previous entry in a list.

    Args:
        unit (str): Defines the unit of time to count from.
            Defaults to Seconds. Acceptable values:
            years, months, days, hours, minutes, seconds, milliseconds, nanoseconds

    Description:
        Given a list of datetimes, compute the time in seconds elapsed since
        the previous item in the list. The result for the first item in the
        list will always be `NaN`.

    Examples:
        >>> from datetime import datetime
        >>> time_since_previous = TimeSincePrevious()
        >>> dates = [datetime(2019, 3, 1, 0, 0, 0),
        ...          datetime(2019, 3, 1, 0, 2, 0),
        ...          datetime(2019, 3, 1, 0, 3, 0),
        ...          datetime(2019, 3, 1, 0, 2, 30),
        ...          datetime(2019, 3, 1, 0, 10, 0)]
        >>> time_since_previous(dates).tolist()
        [nan, 120.0, 60.0, -30.0, 450.0]
    """

    name = "time_since_previous"
    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    description_template = "the time since the previous instance of {}"

    def __init__(self, unit="seconds"):
        self.unit = unit.lower()

    def get_function(self):
        def pd_diff(values):
            return convert_time_units(
                values.diff().apply(lambda x: x.total_seconds()),
                self.unit,
            )

        return pd_diff


================================================
FILE: featuretools/primitives/standard/transform/datetime/utils.py
================================================
from typing import Optional, Tuple

import holidays
import pandas as pd


class HolidayUtil:
    def __init__(self, country="US"):
        try:
            country, subdivision = self.convert_to_subdivision(country)
            self.holidays = holidays.country_holidays(
                country=country,
                subdiv=subdivision,
            )
        except NotImplementedError:
            available_countries = (
                "https://github.com/dr-prodigy/python-holidays#available-countries"
            )
            error = "must be one of the available countries:\n%s" % available_countries
            raise ValueError(error)

        self.federal_holidays = getattr(holidays, country)(years=range(1950, 2075))

    def to_df(self):
        holidays_df = pd.DataFrame(
            sorted(self.federal_holidays.items()),
            columns=["holiday_date", "names"],
        )
        holidays_df.holiday_date = holidays_df.holiday_date.astype("datetime64[ns]")
        return holidays_df

    def convert_to_subdivision(self, country: str) -> Tuple[str, Optional[str]]:
        """Convert country to country + subdivision

           Created in response to library changes that changed countries to subdivisions

        Args:
            country (str): Original country name

        Returns:
            Tuple[str,Optional[str]]: country, subdivsion
        """
        return {
            "ENGLAND": ("GB", country),
            "NORTHERNIRELAND": ("GB", country),
            "PORTUGALEXT": ("PT", "Ext"),
            "PTE": ("PT", "Ext"),
            "SCOTLAND": ("GB", country),
            "UK": ("GB", country),
            "WALES": ("GB", country),
        }.get(country.upper(), (country, None))


================================================
FILE: featuretools/primitives/standard/transform/datetime/week.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Ordinal

from featuretools.primitives.base import TransformPrimitive


class Week(TransformPrimitive):
    """Determines the week of the year from a datetime.

    Description:
        Returns the week of the year from a datetime value. The first week
        of the year starts on January 1, and week numbers increment each
        Monday.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 1, 3),
        ...          datetime(2019, 6, 17, 11, 10, 50),
        ...          datetime(2019, 11, 30, 19, 45, 15)]
        >>> week = Week()
        >>> week(dates).tolist()
        [1, 25, 48]
    """

    name = "week"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(
        logical_type=Ordinal(order=list(range(1, 54))),
        semantic_tags={"category"},
    )

    description_template = "the week of the year of {}"

    def get_function(self):
        def week(vals):
            if hasattr(vals.dt, "isocalendar"):
                return vals.dt.isocalendar().week
            else:
                return vals.dt.week

        return week


================================================
FILE: featuretools/primitives/standard/transform/datetime/weekday.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Ordinal

from featuretools.primitives.base import TransformPrimitive


class Weekday(TransformPrimitive):
    """Determines the day of the week from a datetime.

    Description:
        Returns the day of the week from a datetime value. Weeks
        start on Monday (day 0) and run through Sunday (day 6).

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 3, 1),
        ...          datetime(2019, 6, 17, 11, 10, 50),
        ...          datetime(2019, 11, 30, 19, 45, 15)]
        >>> weekday = Weekday()
        >>> weekday(dates).tolist()
        [4, 0, 5]
    """

    name = "weekday"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(
        logical_type=Ordinal(order=list(range(7))),
        semantic_tags={"category"},
    )

    description_template = "the day of the week of {}"

    def get_function(self):
        def weekday(vals):
            return vals.dt.weekday

        return weekday


================================================
FILE: featuretools/primitives/standard/transform/datetime/year.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Ordinal

from featuretools.primitives.base import TransformPrimitive


class Year(TransformPrimitive):
    """Determines the year value of a datetime.

    Examples:
        >>> from datetime import datetime
        >>> dates = [datetime(2019, 3, 1),
        ...          datetime(2048, 6, 17, 11, 10, 50),
        ...          datetime(1950, 11, 30, 19, 45, 15)]
        >>> year = Year()
        >>> year(dates).tolist()
        [2019, 2048, 1950]
    """

    name = "year"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(
        logical_type=Ordinal(order=list(range(1, 3000))),
        semantic_tags={"category"},
    )

    description_template = "the year of {}"

    def get_function(self):
        def year(vals):
            return vals.dt.year

        return year


================================================
FILE: featuretools/primitives/standard/transform/email/__init__.py
================================================
from featuretools.primitives.standard.transform.email.email_address_to_domain import (
    EmailAddressToDomain,
)
from featuretools.primitives.standard.transform.email.is_free_email_domain import (
    IsFreeEmailDomain,
)


================================================
FILE: featuretools/primitives/standard/transform/email/email_address_to_domain.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Categorical, EmailAddress

from featuretools.primitives.base import TransformPrimitive


class EmailAddressToDomain(TransformPrimitive):
    """Determines the domain of an email

    Description:
        EmailAddress input should be a string. Will return Nan
        if an invalid email address is provided, or if the input is
        not a string.

    Examples:
        >>> email_address_to_domain = EmailAddressToDomain()
        >>> email_address_to_domain(['name@gmail.com', 'name@featuretools.com']).tolist()
        ['gmail.com', 'featuretools.com']
    """

    name = "email_address_to_domain"
    input_types = [ColumnSchema(logical_type=EmailAddress)]
    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})

    def get_function(self):
        def email_address_to_domain(emails):
            # if the input is empty return an empty Series
            if len(emails) == 0:
                return pd.Series([], dtype="category")

            emails_df = pd.DataFrame({"email": emails})

            # if all emails are NaN expand won't propogate NaNs and will fail on indexing
            if emails_df["email"].isnull().all():
                emails_df["domain"] = np.nan
                emails_df["domain"] = emails_df["domain"].astype(object)
            else:
                # .str.strip() and .str.split() return NaN for NaN values and propogate NaNs into new columns
                emails_df["domain"] = (
                    emails_df["email"].str.strip().str.split("@", expand=True)[1]
                )
            return emails_df.domain.values

        return email_address_to_domain


================================================
FILE: featuretools/primitives/standard/transform/email/is_free_email_domain.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, EmailAddress

from featuretools.primitives.base import TransformPrimitive


class IsFreeEmailDomain(TransformPrimitive):
    """Determines if an email address is from a free email domain.

    Description:
        EmailAddress input should be a string. Will return Nan
        if an invalid email address is provided, or if the input is
        not a string. The list of free email domains used in this primitive
        was obtained from https://github.com/willwhite/freemail/blob/master/data/free.txt.

    Examples:
        >>> is_free_email_domain = IsFreeEmailDomain()
        >>> is_free_email_domain(['name@gmail.com', 'name@featuretools.com']).tolist()
        [True, False]
    """

    name = "is_free_email_domain"
    input_types = [ColumnSchema(logical_type=EmailAddress)]
    return_type = ColumnSchema(logical_type=BooleanNullable)
    filename = "free_email_provider_domains.txt"

    def get_function(self):
        file_path = self.get_filepath(self.filename)

        free_domains = pd.read_csv(file_path, header=None, names=["domain"])
        free_domains["domain"] = free_domains.domain.str.strip()

        def is_free_email_domain(emails):
            # if the input is empty return an empty Series
            if len(emails) == 0:
                return pd.Series([], dtype="category")

            emails_df = pd.DataFrame({"email": emails})

            # if all emails are NaN expand won't propogate NaNs and will fail on indexing
            if emails_df["email"].isnull().all():
                emails_df["domain"] = np.nan
            else:
                # .str.strip() and .str.split() return NaN for NaN values and propogate NaNs into new columns
                emails_df["domain"] = (
                    emails_df["email"].str.strip().str.split("@", expand=True)[1]
                )

            emails_df["is_free"] = emails_df["domain"].isin(free_domains["domain"])

            # if there are any NaN domain values, change the series type to allow for
            # both bools and NaN values and set is_free to NaN for the NaN domains
            if emails_df["domain"].isnull().values.any():
                emails_df["is_free"] = emails_df["is_free"].astype("object")
                emails_df.loc[emails_df["domain"].isnull(), "is_free"] = np.nan
            return emails_df.is_free.values

        return is_free_email_domain


================================================
FILE: featuretools/primitives/standard/transform/exponential/__init__.py
================================================
from featuretools.primitives.standard.transform.exponential.exponential_weighted_average import (
    ExponentialWeightedAverage,
)
from featuretools.primitives.standard.transform.exponential.exponential_weighted_std import (
    ExponentialWeightedSTD,
)
from featuretools.primitives.standard.transform.exponential.exponential_weighted_variance import (
    ExponentialWeightedVariance,
)


================================================
FILE: featuretools/primitives/standard/transform/exponential/exponential_weighted_average.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

from featuretools.primitives.base import TransformPrimitive


class ExponentialWeightedAverage(TransformPrimitive):
    """Computes the exponentially weighted moving average for a series of numbers

    Description:
        Returns the exponentially weighted moving average for a series of
        numbers. Exactly one of center of mass (com), span, half-life, and
        alpha must be provided. Missing values can be ignored when calculating
        weights by setting 'ignore_na' to True.

    Args:
        com (float): Specify decay in terms of center of mass for com >= 0.
            Default is None.

        span (float): Specify decay in terms of span for span >= 1.
            Default is None.

        halflife (float): Specify decay in terms of half-life for halflife > 0.
            Default is None.

        alpha (float): Specify smoothing factor alpha directly. Alpha should be
            greater than 0 and less than or equal to 1. Default is None.

        ignore_na (bool): Ignore missing values when calculating weights.
            Default is False.

    Examples:
        >>> exponential_weighted_average = ExponentialWeightedAverage(com=0.5)
        >>> exponential_weighted_average([1, 2, 3, 4]).tolist()
        [1.0, 1.75, 2.615384615384615, 3.55]

        Missing values can be ignored
        >>> ewma_ignorena = ExponentialWeightedAverage(com=0.5, ignore_na=True)
        >>> ewma_ignorena([1, 2, 3, None, 4]).tolist()
        [1.0, 1.75, 2.615384615384615, 2.615384615384615, 3.55]
    """

    name = "exponential_weighted_average"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, com=None, span=None, halflife=None, alpha=None, ignore_na=False):
        if all(x is None for x in [com, span, halflife, alpha]):
            com = 0.5
        self.com = com
        self.span = span
        self.halflife = halflife
        self.alpha = alpha
        self.ignore_na = ignore_na

    def get_function(self):
        def exponential_weighted_average(x):
            return x.ewm(
                com=self.com,
                span=self.span,
                halflife=self.halflife,
                alpha=self.alpha,
                ignore_na=self.ignore_na,
            ).mean()

        return exponential_weighted_average


================================================
FILE: featuretools/primitives/standard/transform/exponential/exponential_weighted_std.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

from featuretools.primitives.base import TransformPrimitive


class ExponentialWeightedSTD(TransformPrimitive):
    """Computes the exponentially weighted moving standard deviation for
    a series of numbers

    Description:
        Returns the exponentially weighted moving standard deviation for a
        series of numbers. Exactly one of center of mass (com), span,
        half-life, and alpha must be provided. Missing values can be ignored
        when calculating weights by setting 'ignore_na' to True.

    Args:
        com (float): Specify decay in terms of center of mass for com >= 0.
            Default is None.

        span (float): Specify decay in terms of span for span >= 1.
            Default is None.

        halflife (float): Specify decay in terms of half-life for halflife > 0.
            Default is None.

        alpha (float): Specify smoothing factor alpha directly. Alpha should be
            greater than 0 and less than or equal to 1. Default is None.

        ignore_na (bool): Ignore missing values when calculating weights.
            Default is False.

    Examples:
        >>> exponential_weighted_std = ExponentialWeightedSTD(com=0.5)
        >>> exponential_weighted_std([1, 2, 3, 7]).tolist()
        [nan, 0.7071067811865475, 0.9198662110077998, 2.9852200022005855]

        Missing values can be ignored

        >>> ewmstd_ignorena = ExponentialWeightedSTD(com=0.5, ignore_na=True)
        >>> ewmstd_ignorena([1, 2, 3, None, 7]).tolist()
        [nan, 0.7071067811865475, 0.9198662110077998, 0.9198662110077998, 2.9852200022005855]
    """

    name = "exponential_weighted_std"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, com=None, span=None, halflife=None, alpha=None, ignore_na=False):
        if all(x is None for x in [com, span, halflife, alpha]):
            com = 0.5
        self.com = com
        self.span = span
        self.halflife = halflife
        self.alpha = alpha
        self.ignore_na = ignore_na

    def get_function(self):
        def exponential_weighted_std(x):
            return x.ewm(
                com=self.com,
                span=self.span,
                halflife=self.halflife,
                alpha=self.alpha,
                ignore_na=self.ignore_na,
            ).std()

        return exponential_weighted_std


================================================
FILE: featuretools/primitives/standard/transform/exponential/exponential_weighted_variance.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

from featuretools.primitives.base import TransformPrimitive


class ExponentialWeightedVariance(TransformPrimitive):
    """Computes the exponentially weighted moving variance for a series of numbers

    Description:
        Returns the exponentially weighted moving variance for a series of
        numbers. Exactly one of center of mass (com), span, half-life, and
        alpha must be provided. Missing values can be ignored when calculating
        weights by setting 'ignore_na' to True.

    Args:
        com (float): Specify decay in terms of center of mass for com >= 0.
            Default is None.

        span (float): Specify decay in terms of span for span >= 1.
            Default is None.

        halflife (float): Specify decay in terms of half-life for halflife > 0.
            Default is None.

        alpha (float): Specify smoothing factor alpha directly. Alpha should be
            greater than 0 and less than or equal to 1. Default is None.

        ignore_na (bool): Ignore missing values when calculating weights.
            Default is False.

    Examples:
        >>> exponential_weighted_variance = ExponentialWeightedVariance(com=0.5)
        >>> exponential_weighted_variance([1, 2, 3, 4]).tolist()
        [nan, 0.49999999999999983, 0.8461538461538459, 1.1230769230769233]

        Missing values can be ignored

        >>> ewmv_ignorena = ExponentialWeightedVariance(com=0.5, ignore_na=True)
        >>> ewmv_ignorena([1, 2, 3, None, 4]).tolist()
        [nan, 0.49999999999999983, 0.8461538461538459, 0.8461538461538459, 1.1230769230769233]
    """

    name = "exponential_weighted_variance"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, com=None, span=None, halflife=None, alpha=None, ignore_na=False):
        if all(x is None for x in [com, span, halflife, alpha]):
            com = 0.5
        self.com = com
        self.span = span
        self.halflife = halflife
        self.alpha = alpha
        self.ignore_na = ignore_na

    def get_function(self):
        def exponential_weighted_average(x):
            return x.ewm(
                com=self.com,
                span=self.span,
                halflife=self.halflife,
                alpha=self.alpha,
                ignore_na=self.ignore_na,
            ).var()

        return exponential_weighted_average


================================================
FILE: featuretools/primitives/standard/transform/file_extension.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Filepath

from featuretools.primitives.base import TransformPrimitive


class FileExtension(TransformPrimitive):
    """Determines the extension of a filepath.

    Description:
        Given a list of filepaths, return the extension
        suffix of each one. If the filepath is missing
        or invalid, return `NaN`.

    Examples:
        >>> file_extension = FileExtension()
        >>> file_extension(['doc.txt', '~/documents/data.json', 'file']).tolist()
        ['.txt', '.json', nan]
    """

    name = "file_extension"
    input_types = [ColumnSchema(logical_type=Filepath)]
    return_type = ColumnSchema(semantic_tags={"category"})

    def get_function(self):
        def file_extension(x):
            p = r"(\.[a-z|A-Z]+$)"
            return x.str.extract(p, expand=False).str.lower()

        return file_extension


================================================
FILE: featuretools/primitives/standard/transform/full_name_to_first_name.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Categorical, PersonFullName

from featuretools.primitives.base import TransformPrimitive


class FullNameToFirstName(TransformPrimitive):
    """Determines the first name from a person's name.

    Description:
        Given a list of names, determines the first name. If
        only a single name is provided, assume this is a first name.
        If only a title and a single name is provided return `nan`.
        This assumes all titles will be followed by a period. Please note,
        in the current implementation, last names containing spaces may
        result in improper first name matches.


    Examples:
        >>> full_name_to_first_name = FullNameToFirstName()
        >>> names = ['Woolf Spector', 'Oliva y Ocana, Dona. Fermina',
        ...          'Ware, Mr. Frederick', 'Peter, Michael J', 'Mr. Brown']
        >>> full_name_to_first_name(names).to_list()
        ['Woolf', 'Oliva', 'Frederick', 'Michael', nan]
    """

    name = "full_name_to_first_name"
    input_types = [ColumnSchema(logical_type=PersonFullName)]
    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})

    def get_function(self):
        def full_name_to_first_name(x):
            title_with_last_pattern = r"(^[A-Z][a-z]+\. [A-Z][a-z]+$)"
            titles_pattern = r"([A-Z][a-z]+)\. "
            df = pd.DataFrame({"names": x})
            # remove any entries with just a title and a name
            df["names"] = df["names"].str.replace(
                title_with_last_pattern,
                "",
                regex=True,
            )
            # remove any known titles
            df["names"] = df["names"].str.replace(titles_pattern, "", regex=True)
            # extract first names
            pattern = r"([A-Z][a-z]+ |, [A-Z][a-z]+$|^[A-Z][a-z]+$)"
            df["first_name"] = df["names"].str.extract(pattern)
            # clean up white space and leftover commas
            df["first_name"] = df["first_name"].str.replace(",", "").str.strip()
            return df["first_name"]

        return full_name_to_first_name


================================================
FILE: featuretools/primitives/standard/transform/full_name_to_last_name.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Categorical, PersonFullName

from featuretools.primitives.base import TransformPrimitive


class FullNameToLastName(TransformPrimitive):
    """Determines the first name from a person's name.

    Description:
        Given a list of names, determines the last name. If
        only a single name is provided, assume this is a first name, and
        return `nan`. This assumes all titles will be followed by a period.


    Examples:
        >>> full_name_to_last_name = FullNameToLastName()
        >>> names = ['Woolf Spector', 'Oliva y Ocana, Dona. Fermina',
        ...          'Ware, Mr. Frederick', 'Peter, Michael J', 'Mr. Brown']
        >>> full_name_to_last_name(names).to_list()
        ['Spector', 'Oliva y Ocana', 'Ware', 'Peter', 'Brown']
    """

    name = "full_name_to_last_name"
    input_types = [ColumnSchema(logical_type=PersonFullName)]
    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})

    def get_function(self):
        def full_name_to_last_name(x):
            titles_pattern = r"([A-Z][a-z]+)\. "
            df = pd.DataFrame({"names": x})
            # extract initial names
            pattern = r"(^.+?,|^[A-Z][a-z]+\. [A-Z][a-z]+$| [A-Z][a-z]+$| [A-Z][a-z]+[/-][A-Z][a-z]+$)"
            df["last_name"] = df["names"].str.extract(pattern)
            # remove titles
            df["last_name"] = df["last_name"].str.replace(
                titles_pattern,
                "",
                regex=True,
            )
            # clean up white space and leftover commas
            df["last_name"] = df["last_name"].str.replace(",", "").str.strip()
            return df["last_name"]

        return full_name_to_last_name


================================================
FILE: featuretools/primitives/standard/transform/full_name_to_title.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Categorical, PersonFullName

from featuretools.primitives.base import TransformPrimitive


class FullNameToTitle(TransformPrimitive):
    """Determines the title from a person's name.

    Description:
        Given a list of names, determines the title, or
        prefix of each name (e.g. "Mr", "Mrs", etc). If
        no title is found, returns `NaN`.

    Examples:
        >>> full_name_to_title = FullNameToTitle()
        >>> names = ['Spector, Mr. Woolf', 'Oliva y Ocana, Dona. Fermina',
        ...          'Ware, Mr. Frederick', 'Peter, Michael J', 'Mr. Brown']
        >>> full_name_to_title(names).to_list()
        ['Mr', 'Dona', 'Mr', nan, 'Mr']
    """

    name = "full_name_to_title"
    input_types = [ColumnSchema(logical_type=PersonFullName)]
    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})

    def get_function(self):
        def full_name_to_title(x):
            pattern = r"([A-Z][a-z]+)\. "
            return x.str.extract(pattern, expand=True)[0]

        return full_name_to_title


================================================
FILE: featuretools/primitives/standard/transform/is_in.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean

from featuretools.primitives.base import TransformPrimitive


class IsIn(TransformPrimitive):
    """Determines whether a value is present in a provided list.

    Examples:
        >>> items = ['string', 10.3, False]
        >>> is_in = IsIn(list_of_outputs=items)
        >>> is_in(['string', 10.5, False]).tolist()
        [True, False, True]
    """

    name = "isin"
    input_types = [ColumnSchema()]
    return_type = ColumnSchema(logical_type=Boolean)

    def __init__(self, list_of_outputs=None):
        self.list_of_outputs = list_of_outputs
        if not list_of_outputs:
            stringified_output_list = "[]"
        else:
            stringified_output_list = ", ".join([str(x) for x in list_of_outputs])
        self.description_template = "whether {{}} is in {}".format(
            stringified_output_list,
        )

    def get_function(self):
        def pd_is_in(array):
            return array.isin(self.list_of_outputs or [])

        return pd_is_in

    def generate_name(self, base_feature_names):
        return "%s.isin(%s)" % (base_feature_names[0], str(self.list_of_outputs))


================================================
FILE: featuretools/primitives/standard/transform/is_null.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean

from featuretools.primitives.base import TransformPrimitive


class IsNull(TransformPrimitive):
    """Determines if a value is null.

    Examples:
        >>> is_null = IsNull()
        >>> is_null([1, None, 3]).tolist()
        [False, True, False]
    """

    name = "is_null"
    input_types = [ColumnSchema()]
    return_type = ColumnSchema(logical_type=Boolean)
    description_template = "whether {} is null"

    def get_function(self):
        def isnull(array):
            return array.isnull()

        return isnull


================================================
FILE: featuretools/primitives/standard/transform/latlong/__init__.py
================================================
from featuretools.primitives.standard.transform.latlong.cityblock_distance import (
    CityblockDistance,
)
from featuretools.primitives.standard.transform.latlong.geomidpoint import GeoMidpoint
from featuretools.primitives.standard.transform.latlong.haversine import Haversine
from featuretools.primitives.standard.transform.latlong.is_in_geobox import IsInGeoBox
from featuretools.primitives.standard.transform.latlong.latitude import Latitude
from featuretools.primitives.standard.transform.latlong.longitude import Longitude


================================================
FILE: featuretools/primitives/standard/transform/latlong/cityblock_distance.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double, LatLong

from featuretools.primitives.base import TransformPrimitive
from featuretools.primitives.standard.transform.latlong.utils import (
    _haversine_calculate,
)


class CityblockDistance(TransformPrimitive):
    """Calculates the distance between points in a city road grid.

    Description:
        This distance is calculated using the haversine formula, which
        takes into account the curvature of the Earth.
        If either input data contains `NaN`s, the calculated
        distance with be `NaN`.
        This calculation is also known as the Mahnattan distance.

    Args:
        unit (str): Determines the unit value to output. Could
            be miles or kilometers. Default is miles.

    Examples:
        >>> cityblock_distance = CityblockDistance()
        >>> DC = (38, -77)
        >>> Boston = (43, -71)
        >>> NYC = (40, -74)
        >>> distances_mi = cityblock_distance([DC, DC], [NYC, Boston])
        >>> np.round(distances_mi, 3).tolist()
        [301.519, 672.089]

        We can also change the units in which the distance is calculated.

        >>> cityblock_distance_kilometers = CityblockDistance(unit='kilometers')
        >>> distances_km = cityblock_distance_kilometers([DC, DC], [NYC, Boston])
        >>> np.round(distances_km, 3).tolist()
        [485.248, 1081.622]
    """

    name = "cityblock_distance"
    input_types = [
        ColumnSchema(logical_type=LatLong),
        ColumnSchema(logical_type=LatLong),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    commutative = True

    def __init__(self, unit="miles"):
        if unit not in ["miles", "kilometers"]:
            raise ValueError("Invalid unit given")
        self.unit = unit

    def get_function(self):
        def cityblock(latlong_1, latlong_2):
            latlong_1 = np.array(latlong_1.tolist())
            latlong_2 = np.array(latlong_2.tolist())
            lat_1s = latlong_1[:, 0]
            lat_2s = latlong_2[:, 0]
            lon_1s = latlong_1[:, 1]
            lon_2s = latlong_2[:, 1]
            lon_dis = _haversine_calculate(lat_1s, lon_1s, lat_1s, lon_2s, self.unit)
            lat_dist = _haversine_calculate(lat_1s, lon_1s, lat_2s, lon_1s, self.unit)
            return pd.Series(lon_dis + lat_dist)

        return cityblock


================================================
FILE: featuretools/primitives/standard/transform/latlong/geomidpoint.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import LatLong

from featuretools.primitives.base import TransformPrimitive


class GeoMidpoint(TransformPrimitive):
    """Determines the geographic center of two coordinates.

    Examples:
        >>> geomidpoint = GeoMidpoint()
        >>> geomidpoint([(42.4, -71.1)], [(40.0, -122.4)])
        [(41.2, -96.75)]
    """

    name = "geomidpoint"
    input_types = [
        ColumnSchema(logical_type=LatLong),
        ColumnSchema(logical_type=LatLong),
    ]
    return_type = ColumnSchema(logical_type=LatLong)
    commutative = True

    def get_function(self):
        def geomidpoint_func(latlong_1, latlong_2):
            latlong_1 = np.array(latlong_1.tolist())
            latlong_2 = np.array(latlong_2.tolist())
            lat_1s = latlong_1[:, 0]
            lat_2s = latlong_2[:, 0]
            lon_1s = latlong_1[:, 1]
            lon_2s = latlong_2[:, 1]

            lat_middle = np.array([lat_1s, lat_2s]).transpose().mean(axis=1)
            lon_middle = np.array([lon_1s, lon_2s]).transpose().mean(axis=1)
            return list(zip(lat_middle, lon_middle))

        return geomidpoint_func


================================================
FILE: featuretools/primitives/standard/transform/latlong/haversine.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import LatLong

from featuretools.primitives.base import TransformPrimitive
from featuretools.primitives.standard.transform.latlong.utils import (
    _haversine_calculate,
)


class Haversine(TransformPrimitive):
    """Calculates the approximate haversine distance between two LatLong columns.

    Args:
        unit (str): Determines the unit value to output. Could
            be `miles` or `kilometers`. Default is `miles`.

    Examples:
        >>> haversine = Haversine()
        >>> distances = haversine([(42.4, -71.1), (40.0, -122.4)],
        ...                       [(40.0, -122.4), (41.2, -96.75)])
        >>> np.round(distances, 3).tolist()
        [2631.231, 1343.289]

        Output units can be specified

        >>> haversine_km = Haversine(unit='kilometers')
        >>> distances_km = haversine_km([(42.4, -71.1), (40.0, -122.4)],
        ...                             [(40.0, -122.4), (41.2, -96.75)])
        >>> np.round(distances_km, 3).tolist()
        [4234.555, 2161.814]
    """

    name = "haversine"
    input_types = [
        ColumnSchema(logical_type=LatLong),
        ColumnSchema(logical_type=LatLong),
    ]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    commutative = True

    def __init__(self, unit="miles"):
        valid_units = ["miles", "kilometers"]
        if unit not in valid_units:
            error_message = "Invalid unit %s provided. Must be one of %s" % (
                unit,
                valid_units,
            )
            raise ValueError(error_message)
        self.unit = unit
        self.description_template = (
            "the haversine distance in {} between {{}} and {{}}".format(self.unit)
        )

    def get_function(self):
        def haversine(latlong_1, latlong_2):
            latlong_1 = np.array(latlong_1.tolist())
            latlong_2 = np.array(latlong_2.tolist())
            lat_1s = latlong_1[:, 0]
            lat_2s = latlong_2[:, 0]
            lon_1s = latlong_1[:, 1]
            lon_2s = latlong_2[:, 1]

            distance = _haversine_calculate(lat_1s, lon_1s, lat_2s, lon_2s, self.unit)
            return distance

        return haversine

    def generate_name(self, base_feature_names):
        name = "{}(".format(self.name.upper())
        name += ", ".join(base_feature_names)
        if self.unit != "miles":
            name += ", unit={}".format(self.unit)
        name += ")"
        return name


================================================
FILE: featuretools/primitives/standard/transform/latlong/is_in_geobox.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable, LatLong

from featuretools.primitives.base import TransformPrimitive


class IsInGeoBox(TransformPrimitive):
    """Determines if coordinates are inside a box defined by two
    corner coordinate points.

    Description:
        Coordinate values should be specified as (latitude, longitude)
        tuples. This primitive is unable to handle coordinates and boxes
        at the poles, and near +/- 180 degrees latitude.

    Args:
        point1 (tuple(float, float)): The coordinates
            of the first corner of the box. Defaults to (0, 0).
        point2 (tuple(float, float)): The coordinates
            of the diagonal corner of the box. Defaults to (0, 0).

    Example:
        >>> is_in_geobox = IsInGeoBox((40.7128, -74.0060), (42.2436, -71.1677))
        >>> is_in_geobox([(41.034, -72.254), (39.125, -87.345)]).tolist()
        [True, False]
    """

    name = "is_in_geobox"
    input_types = [ColumnSchema(logical_type=LatLong)]
    return_type = ColumnSchema(logical_type=BooleanNullable)

    def __init__(self, point1=(0, 0), point2=(0, 0)):
        self.point1 = point1
        self.point2 = point2
        self.lats = np.sort(np.array([point1[0], point2[0]]))
        self.lons = np.sort(np.array([point1[1], point2[1]]))

    def get_function(self):
        def geobox(latlongs):
            transposed = np.transpose(np.array(latlongs.tolist()))
            lats = (self.lats[0] <= transposed[0]) & (self.lats[1] >= transposed[0])
            longs = (self.lons[0] <= transposed[1]) & (self.lons[1] >= transposed[1])
            return lats & longs

        return geobox


================================================
FILE: featuretools/primitives/standard/transform/latlong/latitude.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import LatLong

from featuretools.primitives.base import TransformPrimitive


class Latitude(TransformPrimitive):
    """Returns the first tuple value in a list of LatLong tuples.
       For use with the LatLong logical type.

    Examples:
        >>> latitude = Latitude()
        >>> latitude([(42.4, -71.1),
        ...            (40.0, -122.4),
        ...            (41.2, -96.75)]).tolist()
        [42.4, 40.0, 41.2]
    """

    name = "latitude"
    input_types = [ColumnSchema(logical_type=LatLong)]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    description_template = "the latitude of {}"

    def get_function(self):
        def latitude(latlong):
            latlong = np.array(latlong.tolist())
            return latlong[:, 0]

        return latitude


================================================
FILE: featuretools/primitives/standard/transform/latlong/longitude.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import LatLong

from featuretools.primitives.base import TransformPrimitive


class Longitude(TransformPrimitive):
    """Returns the second tuple value in a list of LatLong tuples.
       For use with the LatLong logical type.

    Examples:
        >>> longitude = Longitude()
        >>> longitude([(42.4, -71.1),
        ...            (40.0, -122.4),
        ...            (41.2, -96.75)]).tolist()
        [-71.1, -122.4, -96.75]
    """

    name = "longitude"
    input_types = [ColumnSchema(logical_type=LatLong)]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    description_template = "the longitude of {}"

    def get_function(self):
        def longitude(latlong):
            latlong = np.array(latlong.tolist())
            return latlong[:, 1]

        return longitude


================================================
FILE: featuretools/primitives/standard/transform/latlong/utils.py
================================================
import numpy as np


def _haversine_calculate(lat_1s, lon_1s, lat_2s, lon_2s, unit):
    # https://stackoverflow.com/a/29546836/2512385
    lon1, lat1, lon2, lat2 = map(np.radians, [lon_1s, lat_1s, lon_2s, lat_2s])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2
    radius_earth = 3958.7613
    if unit == "kilometers":
        radius_earth = 6371.0088
    distances = radius_earth * 2 * np.arcsin(np.sqrt(a))
    return distances


================================================
FILE: featuretools/primitives/standard/transform/natural_language/__init__.py
================================================
from featuretools.primitives.standard.transform.natural_language.count_string import (
    CountString,
)
from featuretools.primitives.standard.transform.natural_language.mean_characters_per_word import (
    MeanCharactersPerWord,
)
from featuretools.primitives.standard.transform.natural_language.median_word_length import (
    MedianWordLength,
)
from featuretools.primitives.standard.transform.natural_language.num_characters import (
    NumCharacters,
)
from featuretools.primitives.standard.transform.natural_language.num_unique_separators import (
    NumUniqueSeparators,
)
from featuretools.primitives.standard.transform.natural_language.num_words import (
    NumWords,
)
from featuretools.primitives.standard.transform.natural_language.number_of_common_words import (
    NumberOfCommonWords,
)
from featuretools.primitives.standard.transform.natural_language.number_of_hashtags import (
    NumberOfHashtags,
)
from featuretools.primitives.standard.transform.natural_language.number_of_mentions import (
    NumberOfMentions,
)
from featuretools.primitives.standard.transform.natural_language.number_of_unique_words import (
    NumberOfUniqueWords,
)
from featuretools.primitives.standard.transform.natural_language.number_of_words_in_quotes import (
    NumberOfWordsInQuotes,
)
from featuretools.primitives.standard.transform.natural_language.punctuation_count import (
    PunctuationCount,
)
from featuretools.primitives.standard.transform.natural_language.title_word_count import (
    TitleWordCount,
)
from featuretools.primitives.standard.transform.natural_language.total_word_length import (
    TotalWordLength,
)
from featuretools.primitives.standard.transform.natural_language.upper_case_count import (
    UpperCaseCount,
)
from featuretools.primitives.standard.transform.natural_language.upper_case_word_count import (
    UpperCaseWordCount,
)
from featuretools.primitives.standard.transform.natural_language.whitespace_count import (
    WhitespaceCount,
)


================================================
FILE: featuretools/primitives/standard/transform/natural_language/constants.py
================================================
from string import punctuation

DELIMITERS = "[ \n\t]"
PUNCTUATION_AND_WHITESPACE = f"[{punctuation}\n\t ]"

common_words_1000 = frozenset(
    [
        "the",
        "of",
        "to",
        "and",
        "a",
        "in",
        "is",
        "it",
        "you",
        "that",
        "he",
        "was",
        "for",
        "on",
        "are",
        "with",
        "as",
        "i",
        "his",
        "they",
        "be",
        "at",
        "one",
        "have",
        "this",
        "from",
        "or",
        "had",
        "by",
        "not",
        "word",
        "but",
        "what",
        "some",
        "we",
        "can",
        "out",
        "other",
        "were",
        "all",
        "there",
        "when",
        "up",
        "use",
        "your",
        "how",
        "said",
        "an",
        "each",
        "she",
        "which",
        "do",
        "their",
        "time",
        "if",
        "will",
        "way",
        "about",
        "many",
        "then",
        "them",
        "write",
        "would",
        "like",
        "so",
        "these",
        "her",
        "long",
        "make",
        "thing",
        "see",
        "him",
        "two",
        "has",
        "look",
        "more",
        "day",
        "could",
        "go",
        "come",
        "did",
        "number",
        "sound",
        "no",
        "most",
        "people",
        "my",
        "over",
        "know",
        "water",
        "than",
        "call",
        "first",
        "who",
        "may",
        "down",
        "side",
        "been",
        "now",
        "find",
        "any",
        "new",
        "work",
        "part",
        "take",
        "get",
        "place",
        "made",
        "live",
        "where",
        "after",
        "back",
        "little",
        "only",
        "round",
        "man",
        "year",
        "came",
        "show",
        "every",
        "good",
        "me",
        "give",
        "our",
        "under",
        "name",
        "very",
        "through",
        "just",
        "form",
        "sentence",
        "great",
        "think",
        "say",
        "help",
        "low",
        "line",
        "differ",
        "turn",
        "cause",
        "much",
        "mean",
        "before",
        "move",
        "right",
        "boy",
        "old",
        "too",
        "same",
        "tell",
        "does",
        "set",
        "three",
        "want",
        "air",
        "well",
        "also",
        "play",
        "small",
        "end",
        "put",
        "home",
        "read",
        "hand",
        "port",
        "large",
        "spell",
        "add",
        "even",
        "land",
        "here",
        "must",
        "big",
        "high",
        "such",
        "follow",
        "act",
        "why",
        "ask",
        "men",
        "change",
        "went",
        "light",
        "kind",
        "off",
        "need",
        "house",
        "picture",
        "try",
        "us",
        "again",
        "animal",
        "point",
        "mother",
        "world",
        "near",
        "build",
        "self",
        "earth",
        "father",
        "head",
        "stand",
        "own",
        "page",
        "should",
        "country",
        "found",
        "answer",
        "school",
        "grow",
        "study",
        "still",
        "learn",
        "plant",
        "cover",
        "food",
        "sun",
        "four",
        "between",
        "state",
        "keep",
        "eye",
        "never",
        "last",
        "let",
        "thought",
        "city",
        "tree",
        "cross",
        "farm",
        "hard",
        "start",
        "might",
        "story",
        "saw",
        "far",
        "sea",
        "draw",
        "left",
        "late",
        "run",
        "don't",
        "while",
        "press",
        "close",
        "night",
        "real",
        "life",
        "few",
        "north",
        "open",
        "seem",
        "together",
        "next",
        "white",
        "children",
        "begin",
        "got",
        "walk",
        "example",
        "ease",
        "paper",
        "group",
        "always",
        "music",
        "those",
        "both",
        "mark",
        "often",
        "letter",
        "until",
        "mile",
        "river",
        "car",
        "feet",
        "care",
        "second",
        "book",
        "carry",
        "took",
        "science",
        "eat",
        "room",
        "friend",
        "began",
        "idea",
        "fish",
        "mountain",
        "stop",
        "once",
        "base",
        "hear",
        "horse",
        "cut",
        "sure",
        "watch",
        "color",
        "face",
        "wood",
        "main",
        "enough",
        "plain",
        "girl",
        "usual",
        "young",
        "ready",
        "above",
        "ever",
        "red",
        "list",
        "though",
        "feel",
        "talk",
        "bird",
        "soon",
        "body",
        "dog",
        "family",
        "direct",
        "pose",
        "leave",
        "song",
        "measure",
        "door",
        "product",
        "black",
        "short",
        "numeral",
        "class",
        "wind",
        "question",
        "happen",
        "complete",
        "ship",
        "area",
        "half",
        "rock",
        "order",
        "fire",
        "south",
        "problem",
        "piece",
        "told",
        "knew",
        "pass",
        "since",
        "top",
        "whole",
        "king",
        "space",
        "heard",
        "best",
        "hour",
        "better",
        "true",
        "during",
        "hundred",
        "five",
        "remember",
        "step",
        "early",
        "hold",
        "west",
        "ground",
        "interest",
        "reach",
        "fast",
        "verb",
        "sing",
        "listen",
        "six",
        "table",
        "travel",
        "less",
        "morning",
        "ten",
        "simple",
        "several",
        "vowel",
        "toward",
        "war",
        "lay",
        "against",
        "pattern",
        "slow",
        "center",
        "love",
        "person",
        "money",
        "serve",
        "appear",
        "road",
        "map",
        "rain",
        "rule",
        "govern",
        "pull",
        "cold",
        "notice",
        "voice",
        "unit",
        "power",
        "town",
        "fine",
        "certain",
        "fly",
        "fall",
        "lead",
        "cry",
        "dark",
        "machine",
        "note",
        "wait",
        "plan",
        "figure",
        "star",
        "box",
        "noun",
        "field",
        "rest",
        "correct",
        "able",
        "pound",
        "done",
        "beauty",
        "drive",
        "stood",
        "contain",
        "front",
        "teach",
        "week",
        "final",
        "gave",
        "green",
        "oh",
        "quick",
        "develop",
        "ocean",
        "warm",
        "free",
        "minute",
        "strong",
        "special",
        "mind",
        "behind",
        "clear",
        "tail",
        "produce",
        "fact",
        "street",
        "inch",
        "multiply",
        "nothing",
        "course",
        "stay",
        "wheel",
        "full",
        "force",
        "blue",
        "object",
        "decide",
        "surface",
        "deep",
        "moon",
        "island",
        "foot",
        "system",
        "busy",
        "test",
        "record",
        "boat",
        "common",
        "gold",
        "possible",
        "plane",
        "stead",
        "dry",
        "wonder",
        "laugh",
        "thousand",
        "ago",
        "ran",
        "check",
        "game",
        "shape",
        "equate",
        "hot",
        "miss",
        "brought",
        "heat",
        "snow",
        "tire",
        "bring",
        "yes",
        "distant",
        "fill",
        "east",
        "paint",
        "language",
        "among",
        "grand",
        "ball",
        "yet",
        "wave",
        "drop",
        "heart",
        "am",
        "present",
        "heavy",
        "dance",
        "engine",
        "position",
        "arm",
        "wide",
        "sail",
        "material",
        "size",
        "vary",
        "settle",
        "speak",
        "weight",
        "general",
        "ice",
        "matter",
        "circle",
        "pair",
        "include",
        "divide",
        "syllable",
        "felt",
        "perhaps",
        "pick",
        "sudden",
        "count",
        "square",
        "reason",
        "length",
        "represent",
        "art",
        "subject",
        "region",
        "energy",
        "hunt",
        "probable",
        "bed",
        "brother",
        "egg",
        "ride",
        "cell",
        "believe",
        "fraction",
        "forest",
        "sit",
        "race",
        "window",
        "store",
        "summer",
        "train",
        "sleep",
        "prove",
        "lone",
        "leg",
        "exercise",
        "wall",
        "catch",
        "mount",
        "wish",
        "sky",
        "board",
        "joy",
        "winter",
        "sat",
        "written",
        "wild",
        "instrument",
        "kept",
        "glass",
        "grass",
        "cow",
        "job",
        "edge",
        "sign",
        "visit",
        "past",
        "soft",
        "fun",
        "bright",
        "gas",
        "weather",
        "month",
        "million",
        "bear",
        "finish",
        "happy",
        "hope",
        "flower",
        "clothe",
        "strange",
        "gone",
        "jump",
        "baby",
        "eight",
        "village",
        "meet",
        "root",
        "buy",
        "raise",
        "solve",
        "metal",
        "whether",
        "push",
        "seven",
        "paragraph",
        "third",
        "shall",
        "held",
        "hair",
        "describe",
        "cook",
        "floor",
        "either",
        "result",
        "burn",
        "hill",
        "safe",
        "cat",
        "century",
        "consider",
        "type",
        "law",
        "bit",
        "coast",
        "copy",
        "phrase",
        "silent",
        "tall",
        "sand",
        "soil",
        "roll",
        "temperature",
        "finger",
        "industry",
        "value",
        "fight",
        "lie",
        "beat",
        "excite",
        "natural",
        "view",
        "sense",
        "ear",
        "else",
        "quite",
        "broke",
        "case",
        "middle",
        "kill",
        "son",
        "lake",
        "moment",
        "scale",
        "loud",
        "spring",
        "observe",
        "child",
        "straight",
        "consonant",
        "nation",
        "dictionary",
        "milk",
        "speed",
        "method",
        "organ",
        "pay",
        "age",
        "section",
        "dress",
        "cloud",
        "surprise",
        "quiet",
        "stone",
        "tiny",
        "climb",
        "cool",
        "design",
        "poor",
        "lot",
        "experiment",
        "bottom",
        "key",
        "iron",
        "single",
        "stick",
        "flat",
        "twenty",
        "skin",
        "smile",
        "crease",
        "hole",
        "trade",
        "melody",
        "trip",
        "office",
        "receive",
        "row",
        "mouth",
        "exact",
        "symbol",
        "die",
        "least",
        "trouble",
        "shout",
        "except",
        "wrote",
        "seed",
        "tone",
        "join",
        "suggest",
        "clean",
        "break",
        "lady",
        "yard",
        "rise",
        "bad",
        "blow",
        "oil",
        "blood",
        "touch",
        "grew",
        "cent",
        "mix",
        "team",
        "wire",
        "cost",
        "lost",
        "brown",
        "wear",
        "garden",
        "equal",
        "sent",
        "choose",
        "fell",
        "fit",
        "flow",
        "fair",
        "bank",
        "collect",
        "save",
        "control",
        "decimal",
        "gentle",
        "woman",
        "captain",
        "practice",
        "separate",
        "difficult",
        "doctor",
        "please",
        "protect",
        "noon",
        "whose",
        "locate",
        "ring",
        "character",
        "insect",
        "caught",
        "period",
        "indicate",
        "radio",
        "spoke",
        "atom",
        "human",
        "history",
        "effect",
        "electric",
        "expect",
        "crop",
        "modern",
        "element",
        "hit",
        "student",
        "corner",
        "party",
        "supply",
        "bone",
        "rail",
        "imagine",
        "provide",
        "agree",
        "thus",
        "capital",
        "won't",
        "chair",
        "danger",
        "fruit",
        "rich",
        "thick",
        "soldier",
        "process",
        "operate",
        "guess",
        "necessary",
        "sharp",
        "wing",
        "create",
        "neighbor",
        "wash",
        "bat",
        "rather",
        "crowd",
        "corn",
        "compare",
        "poem",
        "string",
        "bell",
        "depend",
        "meat",
        "rub",
        "tube",
        "famous",
        "dollar",
        "stream",
        "fear",
        "sight",
        "thin",
        "triangle",
        "planet",
        "hurry",
        "chief",
        "colony",
        "clock",
        "mine",
        "tie",
        "enter",
        "major",
        "fresh",
        "search",
        "send",
        "yellow",
        "gun",
        "allow",
        "print",
        "dead",
        "spot",
        "desert",
        "suit",
        "current",
        "lift",
        "rose",
        "continue",
        "block",
        "chart",
        "hat",
        "sell",
        "success",
        "company",
        "subtract",
        "event",
        "particular",
        "deal",
        "swim",
        "term",
        "opposite",
        "wife",
        "shoe",
        "shoulder",
        "spread",
        "arrange",
        "camp",
        "invent",
        "cotton",
        "born",
        "determine",
        "quart",
        "nine",
        "truck",
        "noise",
        "level",
        "chance",
        "gather",
        "shop",
        "stretch",
        "throw",
        "shine",
        "property",
        "column",
        "molecule",
        "select",
        "wrong",
        "gray",
        "repeat",
        "require",
        "broad",
        "prepare",
        "salt",
        "nose",
        "plural",
        "anger",
        "claim",
        "continent",
        "oxygen",
        "sugar",
        "death",
        "pretty",
        "skill",
        "women",
        "season",
        "solution",
        "magnet",
        "silver",
        "thank",
        "branch",
        "match",
        "suffix",
        "especially",
        "fig",
        "afraid",
        "huge",
        "sister",
        "steel",
        "discuss",
        "forward",
        "similar",
        "guide",
        "experience",
        "score",
        "apple",
        "bought",
        "led",
        "pitch",
        "coat",
        "mass",
        "card",
        "band",
        "rope",
        "slip",
        "win",
        "dream",
        "evening",
        "condition",
        "feed",
        "tool",
        "total",
        "basic",
        "smell",
        "valley",
        "nor",
        "double",
        "seat",
        "arrive",
        "master",
        "track",
        "parent",
        "shore",
        "division",
        "sheet",
        "substance",
        "favor",
        "connect",
        "post",
        "spend",
        "chord",
        "fat",
        "glad",
        "original",
        "share",
        "station",
        "dad",
        "bread",
        "charge",
        "proper",
        "bar",
        "offer",
        "segment",
        "slave",
        "duck",
        "instant",
        "market",
        "degree",
        "populate",
        "chick",
        "dear",
        "enemy",
        "reply",
        "drink",
        "occur",
        "support",
        "speech",
        "nature",
        "range",
        "steam",
        "motion",
        "path",
        "liquid",
        "log",
        "meant",
        "quotient",
        "teeth",
        "shell",
        "neck",
    ],
)  # https://gist.github.com/deekayen/4148741


================================================
FILE: featuretools/primitives/standard/transform/natural_language/count_string.py
================================================
import re

import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.base import TransformPrimitive


class CountString(TransformPrimitive):
    """Determines how many times a given string shows up in a text field.

    Args:
        string (str): The string to determine the count of. Defaults to
            the word "the".
        ignore_case (bool): Determines if case of the string should be
            considered or not. Defaults to true.
        ignore_non_alphanumeric (bool): Determines if non-alphanumeric
            characters should be used in the search. Defaults to False.
        is_regex (bool): Defines if the string argument is a regex or not.
            Defaults to False.
        match_whole_words_only (bool): Determines if whole words should be
            matched or not. For example searching for word `the` against
            `then, the, there` should only return `the` if this argument
            was True. Defaults to False.
    Examples:
        >>> count_string = CountString(string="the")
        >>> count_string(["The problem was difficult.",
        ...               "He was there.",
        ...               "The girl went to the store."]).tolist()
        [1.0, 1.0, 2.0]
        >>> # Match case of string
        >>> count_string_ignore_case = CountString(string="the", ignore_case=False)
        >>> count_string_ignore_case(["The problem was difficult.",
        ...                           "He was there.",
        ...                           "The girl went to the store."]).tolist()
        [0.0, 1.0, 1.0]
        >>> # Ignore non-alphanumeric characters in the search
        >>> count_string_ignore_non_alphanumeric = CountString(string="the",
        ...                                                    ignore_non_alphanumeric=True)
        >>> count_string_ignore_non_alphanumeric(["Th*/e problem was difficult.",
        ...                                       "He was there.",
        ...                                       "The girl went to the store."]).tolist()
        [1.0, 1.0, 2.0]
        >>> # Specify the string as a regex
        >>> count_string_is_regex = CountString(string="t.e", is_regex=True)
        >>> count_string_is_regex(["The problem was difficult.",
        ...                        "He was there.",
        ...                        "The girl went to the store."]).tolist()
        [1.0, 1.0, 2.0]
        >>> # Match whole words only
        >>> count_string_match_whole_words_only = CountString(string="the",
        ...                                                   match_whole_words_only=True)
        >>> count_string_match_whole_words_only(["The problem was difficult.",
        ...                                      "He was there.",
        ...                                      "The girl went to the store."]).tolist()
        [1.0, 0.0, 2.0]
    """

    name = "count_string"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})

    def __init__(
        self,
        string="the",
        ignore_case=True,
        ignore_non_alphanumeric=False,
        is_regex=False,
        match_whole_words_only=False,
    ):
        self.string = string
        self.ignore_case = ignore_case
        self.ignore_non_alphanumeric = ignore_non_alphanumeric
        self.match_whole_words_only = match_whole_words_only
        self.is_regex = is_regex

        # we don't want to strip non alphanumeric characters from the pattern
        # ie h.ll. should match "hello" so we can't strip the dots to make hll
        if not is_regex:
            self.pattern = re.escape(self.process_text(string))
        else:
            self.pattern = string
            if ignore_case:
                self.pattern = self.pattern.lower()

        # \b\b.*\b\b is the same as \b.*\b so we don't have to check if
        # the pattern is given to us as regex and if it already has leading
        # and trailing \b's
        if match_whole_words_only:
            self.pattern = "\\b" + self.pattern + "\\b"

    def process_text(self, text):
        if self.ignore_non_alphanumeric:
            text = re.sub("[^0-9a-zA-Z ]+", "", text)
        if self.ignore_case:
            text = text.lower()
        return text

    def get_function(self):
        def count_string(words):
            if not isinstance(words, str):
                return np.nan
            words = self.process_text(words)
            return len(re.findall(self.pattern, words))

        return np.vectorize(count_string, otypes=[float])


================================================
FILE: featuretools/primitives/standard/transform/natural_language/mean_characters_per_word.py
================================================
# -*- coding: utf-8 -*-

import re

import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double, NaturalLanguage

from featuretools.primitives.base import TransformPrimitive

PUNCTUATION = re.escape("!,.:;?")
END_OF_SENTENCE_PUNCT_RE = re.compile(
    rf"[{PUNCTUATION}]+$|[{PUNCTUATION}]+ |[{PUNCTUATION}]+\n",
)


def _mean_characters_per_word(value):
    if pd.isna(value):
        return np.nan

    # replace end-of-sentence punctuation with space
    value = END_OF_SENTENCE_PUNCT_RE.sub(" ", value)
    words = value.split()
    character_count = [len(x) for x in words]

    return np.mean(character_count) if len(character_count) else 0


class MeanCharactersPerWord(TransformPrimitive):
    """Determines the mean number of characters per word.

    Description:
        Given list of strings, determine the mean number of
        characters per word in each string. A word is defined as
        a series of any characters not separated by white space.
        Punctuation is removed before counting. If a string
        is empty or `NaN`, return `NaN`.

    Examples:
        >>> x = ['This is a test file', 'This is second line', 'third line $1,000']
        >>> mean_characters_per_word = MeanCharactersPerWord()
        >>> mean_characters_per_word(x).tolist()
        [3.0, 4.0, 5.0]
    """

    name = "mean_characters_per_word"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    default_value = 0

    def get_function(self):
        def mean_characters_per_word(series):
            return series.apply(_mean_characters_per_word)

        return mean_characters_per_word


================================================
FILE: featuretools/primitives/standard/transform/natural_language/median_word_length.py
================================================
from numpy import median
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double, NaturalLanguage

from featuretools.primitives.base import TransformPrimitive
from featuretools.primitives.standard.transform.natural_language.constants import (
    DELIMITERS,
)


class MedianWordLength(TransformPrimitive):
    """Determines the median word length.

    Description:
        Given list of strings, determine the median
        word length in each string. A word is defined as
        a series of any characters not separated by a delimiter.
        If a string is empty or `NaN`, return `NaN`.

    Args:
        delimiters_regex (str): Delimiters as a regex string for splitting text into words.
            Defaults to whitespace characters.

    Examples:
        >>> x = ['This is a test file', 'This is second line', 'third line $1,000', None]
        >>> median_word_length = MedianWordLength()
        >>> median_word_length(x).tolist()
        [4.0, 4.0, 5.0, nan]
    """

    name = "median_word_length"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})

    default_value = 0

    def __init__(self, delimiters_regex=DELIMITERS):
        self.delimiters_regex = delimiters_regex

    def get_function(self):
        def get_median(words):
            if isinstance(words, list):
                return median([len(word) for word in words if len(word) != 0])

        def median_word_length(x):
            words = x.str.split(self.delimiters_regex)
            return words.apply(get_median)

        return median_word_length


================================================
FILE: featuretools/primitives/standard/transform/natural_language/num_characters.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.base import TransformPrimitive


class NumCharacters(TransformPrimitive):
    """Calculates the number of characters in a given string, including whitespace and punctuation.

    Description:
        Returns the number of characters in a string. This is equivalent to the length of a string.

    Examples:
        >>> num_characters = NumCharacters()
        >>> num_characters(['This is a string',
        ...                 'second item',
        ...                 'final1']).tolist()
        [16, 11, 6]
    """

    name = "num_characters"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})

    description_template = "the number of characters in {}"

    def get_function(self):
        def character_counter(array):
            def _get_num_characters(elem):
                """Returns the length of elem, or pd.NA given null input"""
                if pd.isna(elem):
                    return pd.NA
                return len(elem)

            return array.apply(_get_num_characters)

        return character_counter


================================================
FILE: featuretools/primitives/standard/transform/natural_language/num_unique_separators.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.base import TransformPrimitive

NATURAL_LANGUAGE_SEPARATORS = [" ", ".", ",", "!", "?", ";", "\n"]


class NumUniqueSeparators(TransformPrimitive):
    r"""Calculates the number of unique separators.

    Description:
        Given a string and a list of separators, determine
        the number of unique separators in each string. If a string
        is null determined by pd.isnull return pd.NA.

    Args:
        separators (list, optional): a list of separator characters to count.
            ``[" ", ".", ",", "!", "?", ";", "\n"]`` is used by default.

    Examples:
        >>> x = ["First. Line.", "This. is the second, line!", "notinlist@#$%^%&"]
        >>> num_unique_separators = NumUniqueSeparators([".", ",", "!"])
        >>> num_unique_separators(x).tolist()
        [1, 3, 0]
    """

    name = "num_unique_separators"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})

    def __init__(self, separators=NATURAL_LANGUAGE_SEPARATORS):
        assert separators is not None, "separators needs to be defined"
        self.separators = separators

    def get_function(self):
        def count_unique_separator(s):
            if pd.isnull(s):
                return pd.NA
            return len(set(self.separators).intersection(set(s)))

        def get_separator_count(column):
            return column.apply(count_unique_separator)

        return get_separator_count


================================================
FILE: featuretools/primitives/standard/transform/natural_language/num_words.py
================================================
import re
from string import punctuation
from typing import Optional

import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.base import TransformPrimitive
from featuretools.primitives.standard.transform.natural_language.constants import (
    DELIMITERS,
)


class NumWords(TransformPrimitive):
    """Determines the number of words in a string. Words are sequences of characters
    delimited by whitespace.

    Examples:
        >>> num_words = NumWords()
        >>> num_words(['This is a string',
        ...            'Two words',
        ...            'no-spaces',
        ...            'Also works with sentences. Second sentence!']).tolist()
        [4, 2, 1, 6]
    """

    name = "num_words"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})

    description_template = "the number of words in {}"

    def get_function(self):
        def word_counter(array):
            def _get_number_of_words(elem: Optional[str]):
                """Returns the number of words in given element,
                or pd.NA given null input"""
                if pd.isna(elem):
                    return pd.NA
                return sum(
                    1 for word in re.split(DELIMITERS, elem) if word.strip(punctuation)
                )

            return array.apply(_get_number_of_words)

        return word_counter


================================================
FILE: featuretools/primitives/standard/transform/natural_language/number_of_common_words.py
================================================
from string import punctuation
from typing import Iterable

import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.base import TransformPrimitive
from featuretools.primitives.standard.transform.natural_language.constants import (
    DELIMITERS,
    common_words_1000,
)


class NumberOfCommonWords(TransformPrimitive):
    """Determines the number of common words in a string.

    Description:
        Given string, determine the number of words that appear in a supplied word set.
        The word set defaults to nlp_primitives.constants.common_words_1000. The string
        is case insensitive. The word bank should consist of only lower case strings. If a string is
        missing, return `NaN`.

    Args:
        word_set (set, optional): The set of words to look for in the string. These
            words should all be lower case strings.
        delimiters_regex (str, optional): The regular expression used to determine
            what separates words. Defaults to whitespace characters.

    Examples:
        >>> x = ['Hey! This is some natural language', 'bacon, cheesburger, AND, fries', 'I! Am. A; duck?']
        >>> number_of_common_words = NumberOfCommonWords(word_set={'and', 'some', 'am', 'a', 'the', 'is', 'i'})
        >>> number_of_common_words(x).tolist()
        [2, 1, 3]

        >>> x = ['Hey! This is. some. natural language']
        >>> number_of_common_words = NumberOfCommonWords(word_set={'hey', 'is', 'some'}, delimiters_regex="[ .]")
        >>> number_of_common_words(x).tolist()
        [3]
    """

    name = "number_of_common_words"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})

    default_value = 0

    def __init__(
        self,
        word_set=common_words_1000,
        delimiters_regex=DELIMITERS,
    ):
        self.delimiters_regex = delimiters_regex
        self.word_set = word_set

    def get_function(self):
        def get_num_in_word_bank(words):
            if not isinstance(words, Iterable):
                return pd.NA
            num_common_words = 0
            for w in words:
                if (
                    w.lower().strip(punctuation) in self.word_set
                ):  # assumes word_set is all lowercase
                    num_common_words += 1
            return num_common_words

        def num_common_words(x):
            words = x.str.split(self.delimiters_regex)
            return words.apply(get_num_in_word_bank)

        return num_common_words


================================================
FILE: featuretools/primitives/standard/transform/natural_language/number_of_hashtags.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.standard.transform.natural_language.count_string import (
    CountString,
)


class NumberOfHashtags(CountString):
    """Determines the number of hashtags in a string.

    Description:
        Given a list of strings, determine the number of hashtags
        in each string.

        A hashtag is defined as a string that meets the following criteria:
            - Starts with a '#' character, followed by a sequence of alphanumeric characters containing at least one alphabetic character
            - Present at the start of a string or after whitespace
            - Terminated by the end of the string, a whitespace, or a punctuation character other than '#'
                - e.g. The string '#yes-no' contains a valid hashtag ('#yes')
                - e.g. The string '#yes#' does not contain a valid hashtag

        This implementation handles Unicode characters.

        This implementation does not impose any character limit on hashtags.

        If a string is missing, return `NaN`.

    Examples:
        >>> x = ['#regular #expression', 'this is a string', '###__regular#1and_0#expression']
        >>> number_of_hashtags = NumberOfHashtags()
        >>> number_of_hashtags(x).tolist()
        [2.0, 0.0, 0.0]
    """

    name = "number_of_hashtags"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    default_value = 0

    def __init__(self):
        pattern = r"((^#)|\s#)(\w*([^\W\d])+\w*)(?![#\w])"
        super().__init__(string=pattern, is_regex=True, ignore_case=False)


================================================
FILE: featuretools/primitives/standard/transform/natural_language/number_of_mentions.py
================================================
import re
import string

from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.standard.transform.natural_language.count_string import (
    CountString,
)


class NumberOfMentions(CountString):
    """Determines the number of mentions in a string.

    Description:
        Given a list of strings, determine the number of mentions
        in each string.

        A mention is defined as a string that meets the following criteria:
            - Starts with a '@' character, followed by a sequence of alphanumeric characters
            - Present at the start of a string or after whitespace
            - Terminated by the end of the string, a whitespace, or a punctuation character other than '@'
                - e.g. The string '@yes-no' contains a valid mention ('@yes')
                - e.g. The string '@yes@' does not contain a valid mention

        This implementation handles Unicode characters.

        This implementation does not impose any character limit on mentions.

        If a string is missing, return `NaN`.

    Examples:
         >>> x = ['@user1 @user2', 'this is a string', '@@@__user1@1and_0@expression']
        >>> number_of_mentions = NumberOfMentions()
        >>> number_of_mentions(x).tolist()
        [2.0, 0.0, 0.0]
    """

    name = "number_of_mentions"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    default_value = 0

    def __init__(self):
        SPECIALS_MINUS_AT = "".join(list(set(string.punctuation) - {"@"}))
        SPECIALS_MINUS_AT = re.escape(SPECIALS_MINUS_AT)
        pattern = rf"((^@)|(\s+@))(\w+)(?=\s|$|[{SPECIALS_MINUS_AT}])"
        super().__init__(string=pattern, is_regex=True, ignore_case=False)


================================================
FILE: featuretools/primitives/standard/transform/natural_language/number_of_unique_words.py
================================================
from string import punctuation
from typing import Iterable

import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.base import TransformPrimitive
from featuretools.primitives.standard.transform.natural_language.constants import (
    DELIMITERS,
)


class NumberOfUniqueWords(TransformPrimitive):
    """Determines the number of unique words in a string.

    Description:
        Determines the number of unique words in a given string. Includes options for
        case-insensitive behavior.

    Args:
        case_insensitive (bool, optional): Specify case_insensitivity when searching for unique words.
        For example, setting this to True would mean "WORD word" would be treated as having
        one unique word. Defaults to False.

    Examples:
        >>> x = ['Word word Word', 'This is a SENTENCE.', 'green red green']
        >>> number_of_unique_words = NumberOfUniqueWords()
        >>> number_of_unique_words(x).tolist()
        [2, 4, 2]

        >>> x = ['word WoRD WORD worD', 'dog dog dog', 'catt CAT caT']
        >>> number_of_unique_words = NumberOfUniqueWords(case_insensitive=True)
        >>> number_of_unique_words(x).tolist()
        [1, 1, 2]
    """

    name = "number_of_unique_words"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})

    default_value = 0

    def __init__(self, case_insensitive=False):
        self.case_insensitive = case_insensitive

    def get_function(self):
        def _unique_word_helper(text):
            if not isinstance(text, Iterable):
                return pd.NA
            unique = set()
            for t in text:
                punct_less = t.strip(punctuation)
                if len(punct_less) > 0:
                    unique.add(punct_less)
            return len(unique)

        def num_unique_words(array):
            if self.case_insensitive:
                array = array.str.lower()
            array = array.str.split(f"{DELIMITERS}")
            return array.apply(_unique_word_helper)

        return num_unique_words


================================================
FILE: featuretools/primitives/standard/transform/natural_language/number_of_words_in_quotes.py
================================================
import re
from string import punctuation

import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.base import TransformPrimitive
from featuretools.primitives.standard.transform.natural_language.constants import (
    DELIMITERS,
)


class NumberOfWordsInQuotes(TransformPrimitive):
    """Determines the number of words in quotes in a string.

    Description:
        Given a list of strings, determine the number of words in quotes
        in each string.

        This implementation handles Unicode characters.

        If a string is missing, return `NaN`.

    Args:
        quote_type (str, optional): Specifies what type of quotation marks to match.
        Argument "single" matches on only single quotes (' ').
        Argument "double" matches words between double quotes (" ").
        Argument "both" matches words between either type of quotes.
        Defaults to "both".

    Examples:
         >>> x = ['"python" java prolog "Diffie-Hellman" "4.99"', "Reach me at 'user@email.com'", "'Here's an interesting example!'"]
        >>> number_of_words_in_quotes = NumberOfWordsInQuotes()
        >>> number_of_words_in_quotes(x).tolist()
        [3, 1, 4]
    """

    name = "number_of_words_in_quotes"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    default_value = 0

    def __init__(self, quote_type="both"):
        if quote_type not in ["both", "single", "double"]:
            raise ValueError(
                f"{quote_type} is not a valid quote_type. Specify 'both', 'single', or 'double'",
            )
        self.quote_type = quote_type
        IN_DOUBLE_QUOTES = r'((^|\W)"(.)*?"(?!\w))'
        IN_SINGLE_QUOTES = r"((^|\W)'(.)*?'(?!\w))"
        if quote_type == "double":
            self.regex = IN_DOUBLE_QUOTES
        elif quote_type == "single":
            self.regex = IN_SINGLE_QUOTES
        else:
            self.regex = f"({IN_SINGLE_QUOTES}|{IN_DOUBLE_QUOTES})"

    def get_function(self):
        def count_words_in_quotes(text):
            if pd.isnull(text):
                return pd.NA
            matches = re.findall(self.regex, text, re.DOTALL)
            count = 0
            for match in matches:
                matched_phrase = match[0]
                words = re.split(f"{DELIMITERS}", matched_phrase)
                for word in words:
                    if len(word.strip(punctuation + " ")):
                        count += 1
            return count

        def num_words_in_quotes(array):
            return array.apply(count_words_in_quotes).astype("Int64")

        return num_words_in_quotes


================================================
FILE: featuretools/primitives/standard/transform/natural_language/punctuation_count.py
================================================
# -*- coding: utf-8 -*-

import re
import string

from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.standard.transform.natural_language.count_string import (
    CountString,
)


class PunctuationCount(CountString):
    """Determines number of punctuation characters in a string.

    Description:
        Given list of strings, determine the number of punctuation
        characters in each string. Looks for any of the following:

        !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~

        If a string is missing, return `NaN`.

    Examples:
        >>> x = ['This is a test file.', 'This is second line', 'third line: $1,000']
        >>> punctuation_count = PunctuationCount()
        >>> punctuation_count(x).tolist()
        [1.0, 0.0, 3.0]
    """

    name = "punctuation_count"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    default_value = 0

    def __init__(self):
        pattern = "(%s)" % "|".join([re.escape(x) for x in string.punctuation])
        super().__init__(string=pattern, is_regex=True, ignore_case=False)


================================================
FILE: featuretools/primitives/standard/transform/natural_language/title_word_count.py
================================================
# -*- coding: utf-8 -*-
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.standard.transform.natural_language.count_string import (
    CountString,
)


class TitleWordCount(CountString):
    """Determines the number of title words in a string.

    Description:
        Given list of strings, determine the number of title words
        in each string. A title word is defined as any word starting
        with a capital letter. Words at the start of a sentence will
        be counted.

        If a string is missing, return `NaN`.

    Examples:
        >>> x = ['My favorite movie is Jaws.', 'this is a string', 'AAA']
        >>> title_word_count = TitleWordCount()
        >>> title_word_count(x).tolist()
        [2.0, 0.0, 1.0]
    """

    name = "title_word_count"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    default_value = 0

    def __init__(self):
        pattern = r"([A-Z][^\s]*)"
        super().__init__(string=pattern, is_regex=True, ignore_case=False)


================================================
FILE: featuretools/primitives/standard/transform/natural_language/total_word_length.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.base import TransformPrimitive
from featuretools.primitives.standard.transform.natural_language.constants import (
    PUNCTUATION_AND_WHITESPACE,
)


class TotalWordLength(TransformPrimitive):
    """Determines the total word length.

    Description:
        Given list of strings, determine the total
        word length in each string. A word is defined as
        a series of any characters not separated by a delimiter.
        If a string is empty or `NaN`, return `NaN`.

    Args:
        delimiters_regex (str): Delimiters as a regex string for splitting text into words.
            Defaults to whitespace characters.

    Examples:
        >>> x = ['This is a test file', 'This is second line', 'third line $1,000', None]
        >>> total_word_length = TotalWordLength()
        >>> total_word_length(x).tolist()
        [15.0, 16.0, 13.0, nan]
    """

    name = "total_word_length"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})

    default_value = 0

    def __init__(self, do_not_count=PUNCTUATION_AND_WHITESPACE):
        self.do_not_count = do_not_count

    def get_function(self):
        def total_word_length(x):
            return x.str.len() - x.str.count(self.do_not_count)

        return total_word_length


================================================
FILE: featuretools/primitives/standard/transform/natural_language/upper_case_count.py
================================================
# -*- coding: utf-8 -*-
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.standard.transform.natural_language.count_string import (
    CountString,
)


class UpperCaseCount(CountString):
    """Calculates the number of upper case letters in text.

    Description:
        Given a list of strings, determine the number of characters in each string
        that are capitalized. Counts every letter individually, not just every
        word that contains capitalized letters.

        If a string is missing, return `NaN`

    Examples:
        >>> x = ['This IS a string.', 'This is a string', 'aaa']
        >>> upper_case_count = UpperCaseCount()
        >>> upper_case_count(x).tolist()
        [3.0, 1.0, 0.0]
    """

    name = "upper_case_count"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    default_value = 0

    def __init__(self):
        pattern = r"([A-Z])"
        super().__init__(string=pattern, is_regex=True, ignore_case=False)


================================================
FILE: featuretools/primitives/standard/transform/natural_language/upper_case_word_count.py
================================================
import re
from string import punctuation

import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import IntegerNullable, NaturalLanguage

from featuretools.primitives.base import TransformPrimitive
from featuretools.primitives.standard.transform.natural_language.constants import (
    DELIMITERS,
)


class UpperCaseWordCount(TransformPrimitive):
    """Determines the number of words in a string that are entirely capitalized.

    Description:
        Given list of strings, determine the number of words in each string
        that are entirely capitalized.

        If a string is missing, return `NaN`.

    Examples:
        >>> x = ['This IS a string.', 'This is a string', 'AAA']
        >>> upper_case_word_count = UpperCaseWordCount()
        >>> upper_case_word_count(x).tolist()
        [1, 0, 1]
    """

    name = "upper_case_word_count"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    default_value = 0

    def get_function(self):
        def upper_case_word_count(x):
            def _count_upper_case_words(elem):
                if pd.isna(elem):
                    return pd.NA
                return sum(
                    1
                    for word in re.split(DELIMITERS, elem)
                    if word.strip(punctuation) and word.upper() == word
                )

            return x.apply(_count_upper_case_words)

        return upper_case_word_count


================================================
FILE: featuretools/primitives/standard/transform/natural_language/whitespace_count.py
================================================
from featuretools.primitives.standard.transform.natural_language.count_string import (
    CountString,
)


class WhitespaceCount(CountString):
    """Calculates number of whitespaces in a string.

    Description:
        Given a list of strings, determine the whitespaces in each string
        If a string is missing, return `NaN`

    Examples:
        >>> x = ['', 'hi im ethan', 'multiple    spaces']
        >>> upper_case_count = WhitespaceCount()
        >>> upper_case_count(x).tolist()
        [0.0, 2.0, 4.0]
    """

    name = "whitespace_count"
    default_value = 0

    def __init__(self):
        super().__init__(string=" ")


================================================
FILE: featuretools/primitives/standard/transform/not_primitive.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable

from featuretools.primitives.base import TransformPrimitive


class Not(TransformPrimitive):
    """Negates a boolean value.

    Examples:
        >>> not_func = Not()
        >>> not_func([True, True, False]).tolist()
        [False, False, True]
    """

    name = "not"
    input_types = [
        [ColumnSchema(logical_type=Boolean)],
        [ColumnSchema(logical_type=BooleanNullable)],
    ]
    return_type = ColumnSchema(logical_type=BooleanNullable)
    description_template = "the negation of {}"

    def generate_name(self, base_feature_names):
        return "NOT({})".format(base_feature_names[0])

    def get_function(self):
        return np.logical_not


================================================
FILE: featuretools/primitives/standard/transform/nth_week_of_month.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base import TransformPrimitive


class NthWeekOfMonth(TransformPrimitive):
    """Determines the nth week of the month from a given date.

    Description:
        Converts a datetime to an float representing the week
        of the month in which the date falls. The first day of
        the month starts week 1, and the week number is incremented
        each Sunday.

    Examples:
        >>> from datetime import datetime
        >>> nth_week_of_month = NthWeekOfMonth()
        >>> times = [datetime(2019, 3, 1),
        ...          datetime(2019, 3, 3),
        ...          datetime(2019, 3, 31),
        ...          datetime(2019, 3, 30)]
        >>> nth_week_of_month(times).tolist()
        [1.0, 2.0, 6.0, 5.0]
    """

    name = "nth_week_of_month"
    input_types = [ColumnSchema(logical_type=Datetime)]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})

    def get_function(self):
        def nth_week_of_month(x):
            df = pd.DataFrame({"date": x})
            df["first_day"] = df.date - pd.to_timedelta(df["date"].dt.day - 1, unit="d")
            df["dom"] = df.date.dt.day
            df["first_day_weekday"] = df.first_day.dt.weekday
            df["adjusted_dom"] = df.dom + df.first_day_weekday + 1
            df.loc[df["first_day_weekday"].astype(float) == 6.0, "adjusted_dom"] = df[
                "dom"
            ]
            df["week_of_month"] = np.ceil(df.adjusted_dom / 7.0)
            return df.week_of_month.values

        return nth_week_of_month


================================================
FILE: featuretools/primitives/standard/transform/numeric/__init__.py
================================================
from featuretools.primitives.standard.transform.numeric.absolute import Absolute
from featuretools.primitives.standard.transform.numeric.cosine import Cosine
from featuretools.primitives.standard.transform.numeric.diff import Diff
from featuretools.primitives.standard.transform.numeric.natural_logarithm import (
    NaturalLogarithm,
)
from featuretools.primitives.standard.transform.numeric.negate import Negate
from featuretools.primitives.standard.transform.numeric.percentile import Percentile
from featuretools.primitives.standard.transform.numeric.rate_of_change import (
    RateOfChange,
)
from featuretools.primitives.standard.transform.numeric.same_as_previous import (
    SameAsPrevious,
)
from featuretools.primitives.standard.transform.numeric.sine import Sine
from featuretools.primitives.standard.transform.numeric.square_root import SquareRoot
from featuretools.primitives.standard.transform.numeric.tangent import Tangent


================================================
FILE: featuretools/primitives/standard/transform/numeric/absolute.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import TransformPrimitive


class Absolute(TransformPrimitive):
    """Computes the absolute value of a number.

    Examples:
        >>> absolute = Absolute()
        >>> absolute([3.0, -5.0, -2.4]).tolist()
        [3.0, 5.0, 2.4]
    """

    name = "absolute"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})

    description_template = "the absolute value of {}"

    def get_function(self):
        return np.absolute


================================================
FILE: featuretools/primitives/standard/transform/numeric/cosine.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

from featuretools.primitives.base import TransformPrimitive


class Cosine(TransformPrimitive):
    """Computes the cosine of a number.

    Examples:
        >>> cos = Cosine()
        >>> cos([0.0, np.pi/2.0, np.pi]).tolist()
        [1.0, 6.123233995736766e-17, -1.0]
    """

    name = "cosine"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})

    description_template = "the cosine of {}"

    def get_function(self):
        return np.cos


================================================
FILE: featuretools/primitives/standard/transform/numeric/diff.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import TransformPrimitive


class Diff(TransformPrimitive):
    """Computes the difference between the value in a list and the
    previous value in that list.

    Args:
        periods (int): The number of periods by which to shift the index row.
            Default is 0. Periods correspond to rows.

    Description:
        Given a list of values, compute the difference from the previous
        item in the list. The result for the first element of the list will
        always be `NaN`.

    Examples:
        >>> diff = Diff()
        >>> values = [1, 10, 3, 4, 15]
        >>> diff(values).tolist()
        [nan, 9.0, -7.0, 1.0, 11.0]

        You can specify the number of periods to shift the values

        >>> values = [1, 2, 4, 7, 11, 16]
        >>> diff_periods = Diff(periods = 1)
        >>> diff_periods(values).tolist()
        [nan, nan, 1.0, 2.0, 3.0, 4.0]
    """

    name = "diff"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    uses_full_dataframe = True
    description_template = "the difference from the previous value of {}"

    def __init__(self, periods=0):
        self.periods = periods

    def get_function(self):
        def pd_diff(values):
            return values.shift(self.periods).diff()

        return pd_diff


================================================
FILE: featuretools/primitives/standard/transform/numeric/natural_logarithm.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

from featuretools.primitives.base import TransformPrimitive


class NaturalLogarithm(TransformPrimitive):
    """Computes the natural logarithm of a number.

    Examples:
        >>> log = NaturalLogarithm()
        >>> results = log([1.0, np.e]).tolist()
        >>> results = [round(x, 2) for x in results]
        >>> results
        [0.0, 1.0]
    """

    name = "natural_logarithm"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})

    description_template = "the natural logarithm of {}"

    def get_function(self):
        return np.log


================================================
FILE: featuretools/primitives/standard/transform/numeric/negate.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import TransformPrimitive


class Negate(TransformPrimitive):
    """Negates a numeric value.

    Examples:
        >>> negate = Negate()
        >>> negate([1.0, 23.2, -7.0]).tolist()
        [-1.0, -23.2, 7.0]
    """

    name = "negate"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    description_template = "the negation of {}"

    def get_function(self):
        def negate(vals):
            return vals * -1

        return negate

    def generate_name(self, base_feature_names):
        return "-(%s)" % (base_feature_names[0])


================================================
FILE: featuretools/primitives/standard/transform/numeric/percentile.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import TransformPrimitive


class Percentile(TransformPrimitive):
    """Determines the percentile rank for each value in a list.

    Examples:
        >>> percentile = Percentile()
        >>> percentile([10, 15, 1, 20]).tolist()
        [0.5, 0.75, 0.25, 1.0]

        Nan values are ignored when determining rank

        >>> percentile([10, 15, 1, None, 20]).tolist()
        [0.5, 0.75, 0.25, nan, 1.0]
    """

    name = "percentile"
    uses_full_dataframe = True
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    description_template = "the percentile rank of {}"

    def get_function(self):
        return lambda array: array.rank(pct=True)


================================================
FILE: featuretools/primitives/standard/transform/numeric/rate_of_change.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base import TransformPrimitive


class RateOfChange(TransformPrimitive):
    """Computes the rate of change of a value per second.

    Examples:
        >>> import pandas as pd
        >>> rate_of_change = RateOfChange()
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> results = rate_of_change([0, 30, 180, -90, 0], times).tolist()
        >>> results = [round(x, 2) for x in results]
        >>> results
        [nan, 0.5, 2.5, -4.5, 1.5]
    """

    name = "rate_of_change"
    input_types = [
        ColumnSchema(semantic_tags={"numeric"}),
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True
    description_template = "the rate of change of {} per second"

    def get_function(self):
        def rate_of_change(values, time):
            time_delta = time.diff().dt.total_seconds()
            value_delta = values.diff()
            return value_delta / time_delta

        return rate_of_change


================================================
FILE: featuretools/primitives/standard/transform/numeric/same_as_previous.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import BooleanNullable

from featuretools.primitives.base import TransformPrimitive


class SameAsPrevious(TransformPrimitive):
    """Determines if a value is equal to the previous value in a list.

    Description:
        Compares a value in a list to the previous value and returns True if
        the value is equal to the previous value or False otherwise. The
        first item in the output will always be False, since there is no previous
        element for the first element comparison.

        Any nan values in the input will be filled using either a forward-fill
        or backward-fill method, specified by the fill_method argument. The number
        of consecutive nan values that get filled can be limited with the limit
        argument. Any nan values left after filling will result in False being
        returned for any comparison involving the nan value.

    Args:
        fill_method (str): Method for filling gaps in series. Valid
        options are `backfill`, `bfill`, `pad`, `ffill`.
        `pad / ffill`: fill gap with last valid observation.
        `backfill / bfill`: fill gap with next valid observation.
        Default is `pad`.

        limit (int): The max number of consecutive NaN values in a gap that
            can be filled. Default is None.

    Examples:
        >>> same_as_previous = SameAsPrevious()
        >>> same_as_previous([1, 2, 2, 4]).tolist()
        [False, False, True, False]

        The fill method for nan values can be specified

        >>> same_as_previous_fillna = SameAsPrevious(fill_method="bfill")
        >>> same_as_previous_fillna([1, None, 2, 4]).tolist()
        [False, False, True, False]

        The number of nan values that are filled can be limited

        >>> same_as_previous_limitfill = SameAsPrevious(limit=2)
        >>> same_as_previous_limitfill([1, None, None, None, 2, 3]).tolist()
        [False, True, True, False, False, False]
    """

    name = "same_as_previous"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(BooleanNullable)

    def __init__(self, fill_method="pad", limit=None):
        if fill_method not in ["backfill", "bfill", "pad", "ffill"]:
            raise ValueError("Invalid fill_method")
        self.fill_method = fill_method
        self.limit = limit

    def get_function(self):
        def same_as_previous(x):
            x = x.fillna(method=self.fill_method, limit=self.limit)
            x = x.eq(x.shift())
            # first value will always be false, since there is no previous value
            x.iloc[0] = False
            return x

        return same_as_previous


================================================
FILE: featuretools/primitives/standard/transform/numeric/sine.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

from featuretools.primitives.base import TransformPrimitive


class Sine(TransformPrimitive):
    """Computes the sine of a number.

    Examples:
        >>> sin = Sine()
        >>> sin([-np.pi/2.0, 0.0, np.pi/2.0]).tolist()
        [-1.0, 0.0, 1.0]
    """

    name = "sine"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})

    description_template = "the sine of {}"

    def get_function(self):
        return np.sin


================================================
FILE: featuretools/primitives/standard/transform/numeric/square_root.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

from featuretools.primitives.base import TransformPrimitive


class SquareRoot(TransformPrimitive):
    """Computes the square root of a number.

    Examples:
        >>> sqrt = SquareRoot()
        >>> sqrt([9.0, 16.0, 4.0]).tolist()
        [3.0, 4.0, 2.0]
    """

    name = "square_root"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})

    description_template = "the square root of {}"

    def get_function(self):
        return np.sqrt


================================================
FILE: featuretools/primitives/standard/transform/numeric/tangent.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

from featuretools.primitives.base import TransformPrimitive


class Tangent(TransformPrimitive):
    """Computes the tangent of a number.

    Examples:
        >>> tan = Tangent()
        >>> tan([-np.pi, 0.0, np.pi/2.0]).tolist()
        [1.2246467991473532e-16, 0.0, 1.633123935319537e+16]
    """

    name = "tangent"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})

    description_template = "the tangent of {}"

    def get_function(self):
        return np.tan


================================================
FILE: featuretools/primitives/standard/transform/percent_change.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

from featuretools.primitives.base import TransformPrimitive


class PercentChange(TransformPrimitive):
    """Determines the percent difference between values in a list.

    Description:
        Given a list of numbers, return the percent difference
        between each subsequent number. Percentages are shown in
        decimal form (not multiplied by 100). Uses pandas' pct_change
        function.

    Args:
        periods (int): Periods to shift for calculating percent change.
            Default is 1.

        fill_method (str): Method for filling gaps in reindexed
            Series. Valid options are `backfill`, `bfill`, `pad`, `ffill`.
            `pad / ffill`: fill gap with last valid observation.
            `backfill / bfill`: fill gap with next valid observation.
            Default is `pad`.

        limit (int): The max number of consecutive NaN values in a gap that
            can be filled. Default is None.

        freq (DateOffset, timedelta, or offset alias string):
            If `freq` is specified, instead of calcualting change between subsequent
            points, PercentChange will calculate change between points with a
            certain interval between their date indices. `freq` defines the
            desired interval. When freq is used, the resulting index will also be
            filled to include any missing dates from the specified interval.

            If the index is not date/datetime and freq is used, it will raise a
            NotImplementedError.

            If freq is None, no changes will be applied. Default is None.

    Examples:
        >>> percent_change = PercentChange()
        >>> percent_change([2, 5, 15, 3, 3, 9, 4.5]).to_list()
        [nan, 1.5, 2.0, -0.8, 0.0, 2.0, -0.5]

        We can control the number of periods to return the percent
            difference between points further from one another.

        >>> percent_change_2 = PercentChange(periods=2)
        >>> percent_change_2([2, 5, 15, 3, 3, 9, 4.5]).to_list()
        [nan, nan, 6.5, -0.4, -0.8, 2.0, 0.5]

        We can control the method used to handle gaps in data.

        >>> percent_change = PercentChange()
        >>> percent_change([2, 4, 8, None, 16, None, 32, None]).to_list()
        [nan, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0]
        >>> percent_change_backfill = PercentChange(fill_method='backfill')
        >>> percent_change_backfill([2, 4, 8, None, 16, None, 32, None]).to_list()
        [nan, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, nan]

        We can also control the maximum number of NaN values to fill in a gap.

        >>> percent_change = PercentChange()
        >>> percent_change([2, None, None, None, 4]).to_list()
        [nan, 0.0, 0.0, 0.0, 1.0]
        >>> percent_change_limited = PercentChange(limit=2)
        >>> percent_change_limited([2, None, None, None, 4]).to_list()
        [nan, 0.0, 0.0, nan, nan]

        Finally, we can specify a date frequency on which to calculate percent
            change.

        >>> import pandas as pd
        >>> dates = pd.DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-05'])
        >>> x_indexed = pd.Series([1, 2, 3, 4], index=dates)
        >>> percent_change = PercentChange()
        >>> percent_change(x_indexed).to_list()
        [nan, 1.0, 0.5, 0.33333333333333326]
        >>> date_offset = pd.tseries.offsets.DateOffset(days=1)
        >>> percent_change_freq = PercentChange(freq=date_offset)
        >>> percent_change_freq(x_indexed).to_list()
        [nan, 1.0, 0.5, nan]
    """

    name = "percent_change"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})

    def __init__(self, periods=1, fill_method="pad", limit=None, freq=None):
        if fill_method not in ["backfill", "bfill", "pad", "ffill"]:
            raise ValueError("Invalid fill_method")
        self.periods = periods
        self.fill_method = fill_method
        self.limit = limit
        self.freq = freq

    def get_function(self):
        def percent_change(data):
            return data.pct_change(
                self.periods,
                self.fill_method,
                self.limit,
                self.freq,
            )

        return percent_change


================================================
FILE: featuretools/primitives/standard/transform/postal/__init__.py
================================================
from featuretools.primitives.standard.transform.postal.one_digit_postal_code import (
    OneDigitPostalCode,
)
from featuretools.primitives.standard.transform.postal.two_digit_postal_code import (
    TwoDigitPostalCode,
)


================================================
FILE: featuretools/primitives/standard/transform/postal/one_digit_postal_code.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Categorical, PostalCode

from featuretools.primitives.base import TransformPrimitive


class OneDigitPostalCode(TransformPrimitive):
    """Returns the one digit prefix of a given postal code.

    Description:
        Given a list of postal codes, returns the one digit prefix for each postal code.

    Examples:
        >>> one_digit_postal_code = OneDigitPostalCode()
        >>> one_digit_postal_code(['92432', '34514']).tolist()
        ['9', '3']
    """

    name = "one_digit_postal_code"
    input_types = [ColumnSchema(logical_type=PostalCode)]
    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})
    description_template = "The one digit postal code prefix of {}"

    def get_function(self):
        def one_digit_postal_code(postal_codes):
            def transform_postal_code(pc):
                return str(pc)[0] if pd.notna(pc) else pd.NA

            return postal_codes.apply(transform_postal_code)

        return one_digit_postal_code


================================================
FILE: featuretools/primitives/standard/transform/postal/two_digit_postal_code.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Categorical, PostalCode

from featuretools.primitives.base import TransformPrimitive


class TwoDigitPostalCode(TransformPrimitive):
    """Returns the two digit prefix of a given postal code.

    Description:
        Given a list of postal codes, returns the two digit prefix for each postal code.

    Examples:
        >>> two_digit_postal_code = TwoDigitPostalCode()
        >>> two_digit_postal_code(['92432', '34514']).tolist()
        ['92', '34']
    """

    name = "two_digit_postal_code"
    input_types = [ColumnSchema(logical_type=PostalCode)]

    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})
    description_template = "The two digit postal code prefix of {}"

    def get_function(self):
        def two_digit_postal_code(postal_codes):
            def transform_postal_code(pc):
                return str(pc)[:2] if pd.notna(pc) else pd.NA

            return postal_codes.apply(transform_postal_code)

        return two_digit_postal_code


================================================
FILE: featuretools/primitives/standard/transform/savgol_filter.py
================================================
from math import floor

import numpy as np
from scipy.signal import savgol_coeffs, savgol_filter
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

from featuretools.primitives.base import TransformPrimitive


class SavgolFilter(TransformPrimitive):
    """Applies a Savitzky-Golay filter to a list of values.

    Description:
        Given a list of values, return a smoothed list which increases
        the signal to noise ratio without greatly distoring the
        signal. Uses the `Savitzky–Golay filter` method.

        If the input list has less than 20 values, it will be returned
        as is.

        See the following page for more info:
        https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.signal.savgol_filter.html

    Args:
        window_length (int):  The length of the filter window (i.e. the number
            of coefficients). `window_length` must be a positive odd integer.

        polyorder (int): The order of the polynomial used to fit the samples.
            `polyorder` must be less than `window_length`.

        deriv (int): Optional. The order of the derivative to compute.  This
            must be a nonnegative integer.  The default is 0, which means to
            filter the data without differentiating.

        delta (float): Optional. The spacing of the samples to which the filter
            will be applied. This is only used if deriv > 0.  Default is 1.0.

        mode (str): Optional. Must be 'mirror', 'constant', 'nearest', 'wrap'
            or 'interp'.  This determines the type of extension to use for the
            padded signal to which the filter is applied.  When `mode` is
            'constant', the padding value is given by `cval`.  See the Notes
            for more details on 'mirror', 'constant', 'wrap', and 'nearest'.

            When the 'interp' mode is selected (the default), no extension
            is used.  Instead, a degree `polyorder` polynomial is fit to the
            last `window_length` values of the edges, and this polynomial is
            used to evaluate the last `window_length // 2` output values.

        cval (scalar): Optional. Value to fill past the edges of the input
            if `mode` is 'constant'. Default is 0.0.

    Examples:
        >>> savgol_filter = SavgolFilter()
        >>> data = [0, 1, 1, 2, 3, 4, 5, 7, 8, 7, 9, 9, 12, 11, 12, 14, 15, 17, 17, 17, 20]
        >>> [round(x, 4) for x in savgol_filter(data).tolist()[:3]]
        [0.0429, 0.8286, 1.2571]

        We can control `window_length` and `polyorder` of the filter.

        >>> savgol_filter = SavgolFilter(window_length=13, polyorder=3)
        >>> [round(x, 4) for x in savgol_filter(data).tolist()[:3]]
        [-0.0962, 0.6484, 1.4451]

        We can also control the `deriv` and `delta` parameters.

        >>> savgol_filter = SavgolFilter(deriv=1, delta=1.5)
        >>> [round(x, 4) for x in savgol_filter(data).tolist()[:3]]
        [0.754, 0.3492, 0.2778]

        Finally, we can use `mode` to control how edge values are handled.

        >>> savgol_filter = SavgolFilter(mode='constant', cval=5)
        >>> [round(x, 4) for x in savgol_filter(data).tolist()[:3]]
        [1.5429, 0.2286, 1.2571]
    """

    name = "savgol_filter"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})

    def __init__(
        self,
        window_length=None,
        polyorder=None,
        deriv=0,
        delta=1.0,
        mode="interp",
        cval=0.0,
    ):
        if window_length is not None and polyorder is not None:
            try:
                if mode not in ["mirror", "constant", "nearest", "interp", "wrap"]:
                    raise ValueError(
                        "mode must be 'mirror', 'constant', "
                        "'nearest', 'wrap' or 'interp'.",
                    )
                savgol_coeffs(window_length, polyorder, deriv=deriv, delta=delta)
            except Exception:
                raise
        elif (window_length is None and polyorder is not None) or (
            window_length is not None and polyorder is None
        ):
            error_text = (
                "Both window_length and polyorder must be defined if you define one."
            )
            raise ValueError(error_text)

        self.window_length = window_length
        self.polyorder = polyorder
        self.deriv = deriv
        self.delta = delta
        self.mode = mode
        self.cval = cval

    def get_function(self):
        def smooth(x):
            if x.shape[0] < 20:
                return x
            if np.isnan(np.min(x)):
                # interpolate the nan values, works for edges & middle nans
                mask = np.isnan(x)
                x[mask] = np.interp(
                    np.flatnonzero(mask),
                    np.flatnonzero(~mask),
                    x[~mask],
                )
            window_length = self.window_length
            polyorder = self.polyorder
            if window_length is None and polyorder is None:
                window_length = floor(len(x) / 10) * 2 + 1
                polyorder = 3
            return savgol_filter(
                x,
                window_length=window_length,
                polyorder=polyorder,
                deriv=self.deriv,
                delta=self.delta,
                mode=self.mode,
                cval=self.cval,
            )

        return smooth


================================================
FILE: featuretools/primitives/standard/transform/time_series/__init__.py
================================================
from featuretools.primitives.standard.transform.time_series.lag import Lag
from featuretools.primitives.standard.transform.time_series.numeric_lag import (
    NumericLag,
)
from featuretools.primitives.standard.transform.time_series.rolling_count import (
    RollingCount,
)
from featuretools.primitives.standard.transform.time_series.rolling_max import (
    RollingMax,
)
from featuretools.primitives.standard.transform.time_series.rolling_mean import (
    RollingMean,
)
from featuretools.primitives.standard.transform.time_series.rolling_min import (
    RollingMin,
)
from featuretools.primitives.standard.transform.time_series.rolling_outlier_count import (
    RollingOutlierCount,
)
from featuretools.primitives.standard.transform.time_series.rolling_std import (
    RollingSTD,
)
from featuretools.primitives.standard.transform.time_series.rolling_trend import (
    RollingTrend,
)
from featuretools.primitives.standard.transform.time_series.expanding import (
    ExpandingCount,
    ExpandingMax,
    ExpandingMean,
    ExpandingMin,
    ExpandingSTD,
    ExpandingTrend,
)


================================================
FILE: featuretools/primitives/standard/transform/time_series/expanding/__init__.py
================================================
from featuretools.primitives.standard.transform.time_series.expanding.expanding_count import (
    ExpandingCount,
)
from featuretools.primitives.standard.transform.time_series.expanding.expanding_max import (
    ExpandingMax,
)
from featuretools.primitives.standard.transform.time_series.expanding.expanding_mean import (
    ExpandingMean,
)
from featuretools.primitives.standard.transform.time_series.expanding.expanding_min import (
    ExpandingMin,
)
from featuretools.primitives.standard.transform.time_series.expanding.expanding_std import (
    ExpandingSTD,
)
from featuretools.primitives.standard.transform.time_series.expanding.expanding_trend import (
    ExpandingTrend,
)


================================================
FILE: featuretools/primitives/standard/transform/time_series/expanding/expanding_count.py
================================================
import numpy as np
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, IntegerNullable

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.transform.time_series.utils import (
    _apply_gap_for_expanding_primitives,
)


class ExpandingCount(TransformPrimitive):
    """Computes the expanding count of events over a given window.

    Description:
        Given a list of datetimes, returns an expanding count starting
        at the row `gap` rows away from the current row. An expanding
        primitive calculates the value of a primitive for a given time
        with all the data available up to the corresponding point in time.

        Input datetimes should be monotonic.

    Args:
        gap (int, optional): Specifies a gap backwards from each instance before the
            usable data begins. Corresponds to number of rows. Defaults to 1.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Defaults to 1.


    Examples:
        >>> import pandas as pd
        >>> expanding_count = ExpandingCount()
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> expanding_count(times).tolist()
        [nan, 1.0, 2.0, 3.0, 4.0]

        We can also control the gap before the expanding calculation.

        >>> import pandas as pd
        >>> expanding_count = ExpandingCount(gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> expanding_count(times).tolist()
        [1.0, 2.0, 3.0, 4.0, 5.0]

        We can also control the minimum number of periods required for the rolling calculation.

        >>> import pandas as pd
        >>> expanding_count = ExpandingCount(min_periods=3)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> expanding_count(times).tolist()
        [nan, nan, nan, 3.0, 4.0]
    """

    name = "expanding_count"
    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})]
    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, gap=1, min_periods=1):
        self.gap = gap
        self.min_periods = min_periods

    def get_function(self):
        def expanding_count(datetime_series):
            datetime_series = _apply_gap_for_expanding_primitives(
                datetime_series,
                self.gap,
            )
            count_series = datetime_series.expanding(
                min_periods=self.min_periods,
            ).count()
            num_nans = self.gap + self.min_periods - 1
            count_series[range(num_nans)] = np.nan
            return count_series

        return expanding_count


================================================
FILE: featuretools/primitives/standard/transform/time_series/expanding/expanding_max.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.transform.time_series.utils import (
    _apply_gap_for_expanding_primitives,
)


class ExpandingMax(TransformPrimitive):
    """Computes the expanding maximum of events over a given window.

    Description:
        Given a list of datetimes, returns an expanding maximum starting
        at the row `gap` rows away from the current row. An expanding
        primitive calculates the value of a primitive for a given time
        with all the data available up to the corresponding point in time.

        Input datetimes should be monotonic.

    Args:
        gap (int, optional): Specifies a gap backwards from each instance before the
            usable data begins. Corresponds to number of rows. Defaults to 1.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Defaults to 1.


    Examples:
        >>> import pandas as pd
        >>> expanding_min = ExpandingMax()
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> expanding_min(times, [2, 4, 6, 7, 2]).tolist()
        [nan, 2.0, 4.0, 6.0, 7.0]

        We can also control the gap before the expanding calculation.

        >>> import pandas as pd
        >>> expanding_min = ExpandingMax(gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> expanding_min(times, [2, 4, 6, 7, 2]).tolist()
        [2.0, 4.0, 6.0, 7.0, 7.0]

        We can also control the minimum number of periods required for the rolling calculation.

        >>> import pandas as pd
        >>> expanding_min = ExpandingMax(min_periods=3)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> expanding_min(times, [2, 4, 6, 7, 2]).tolist()
        [nan, nan, nan, 6.0, 7.0]
    """

    name = "expanding_max"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, gap=1, min_periods=1):
        self.gap = gap
        self.min_periods = min_periods

    def get_function(self):
        def expanding_max(datetime, numeric):
            x = pd.Series(numeric.values, index=datetime)
            x = _apply_gap_for_expanding_primitives(x, self.gap)
            return x.expanding(min_periods=self.min_periods).max().values

        return expanding_max


================================================
FILE: featuretools/primitives/standard/transform/time_series/expanding/expanding_mean.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.transform.time_series.utils import (
    _apply_gap_for_expanding_primitives,
)


class ExpandingMean(TransformPrimitive):
    """Computes the expanding mean of events over a given window.

    Description:
        Given a list of datetimes, returns an expanding mean starting
        at the row `gap` rows away from the current row. An expanding
        primitive calculates the value of a primitive for a given time
        with all the data available up to the corresponding point in time.

        Input datetimes should be monotonic.

    Args:
        gap (int, optional): Specifies a gap backwards from each instance before the
            usable data begins. Corresponds to number of rows. Defaults to 1.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Defaults to 1.


    Examples:
        >>> import pandas as pd
        >>> expanding_mean = ExpandingMean()
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> expanding_mean(times, [5, 4, 3, 2, 1]).tolist()
        [nan, 5.0, 4.5, 4.0, 3.5]

        We can also control the gap before the expanding calculation.

        >>> import pandas as pd
        >>> expanding_mean = ExpandingMean(gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> expanding_mean(times, [5, 4, 3, 2, 1]).tolist()
        [5.0, 4.5, 4.0, 3.5, 3.0]

        We can also control the minimum number of periods required for the rolling calculation.

        >>> import pandas as pd
        >>> expanding_mean = ExpandingMean(min_periods=3)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> expanding_mean(times, [5, 4, 3, 2, 1]).tolist()
        [nan, nan, nan, 4.0, 3.5]
    """

    name = "expanding_mean"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, gap=1, min_periods=1):
        self.gap = gap
        self.min_periods = min_periods

    def get_function(self):
        def expanding_mean(datetime, numeric):
            x = pd.Series(numeric.values, index=datetime)
            x = _apply_gap_for_expanding_primitives(x, self.gap)
            return x.expanding(min_periods=self.min_periods).mean().values

        return expanding_mean


================================================
FILE: featuretools/primitives/standard/transform/time_series/expanding/expanding_min.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.transform.time_series.utils import (
    _apply_gap_for_expanding_primitives,
)


class ExpandingMin(TransformPrimitive):
    """Computes the expanding minimum of events over a given window.

    Description:
        Given a list of datetimes, returns an expanding minimum starting
        at the row `gap` rows away from the current row. An expanding
        primitive calculates the value of a primitive for a given time
        with all the data available up to the corresponding point in time.

        Input datetimes should be monotonic.

    Args:
        gap (int, optional): Specifies a gap backwards from each instance before the
            usable data begins. Corresponds to number of rows. Defaults to 1.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Defaults to 1.

    Examples:
        >>> import pandas as pd
        >>> expanding_min = ExpandingMin()
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> expanding_min(times, [5, 4, 3, 2, 1]).tolist()
        [nan, 5.0, 4.0, 3.0, 2.0]

        We can also control the gap before the expanding calculation.

        >>> import pandas as pd
        >>> expanding_min = ExpandingMin(gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> expanding_min(times, [5, 4, 3, 2, 1]).tolist()
        [5.0, 4.0, 3.0, 2.0, 1.0]

        We can also control the minimum number of periods required for the rolling calculation.

        >>> import pandas as pd
        >>> expanding_min = ExpandingMin(min_periods=3)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> expanding_min(times, [5, 4, 3, 2, 1]).tolist()
        [nan, nan, nan, 3.0, 2.0]
    """

    name = "expanding_min"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, gap=1, min_periods=1):
        self.gap = gap
        self.min_periods = min_periods

    def get_function(self):
        def expanding_min(datetime, numeric):
            x = pd.Series(numeric.values, index=datetime)
            x = _apply_gap_for_expanding_primitives(x, self.gap)
            return x.expanding(min_periods=self.min_periods).min().values

        return expanding_min


================================================
FILE: featuretools/primitives/standard/transform/time_series/expanding/expanding_std.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.transform.time_series.utils import (
    _apply_gap_for_expanding_primitives,
)


class ExpandingSTD(TransformPrimitive):
    """Computes the expanding standard deviation for events over a given window.

    Description:
        Given a list of datetimes, returns the expanding standard deviation
        starting at the row `gap` rows away from the current row. An expanding
        primitive calculates the value of a primitive for a given time
        with all the data available up to the corresponding point in time.

        Input datetimes should be monotonic.

    Args:
        gap (int, optional): Specifies a gap backwards from each instance before the
            usable data begins. Corresponds to number of rows. Defaults to 1.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Defaults to 1.


    Examples:
        >>> import pandas as pd
        >>> expanding_std = ExpandingSTD()
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> ans = expanding_std(times, [5, 4, 3, 2, 1]).tolist()
        >>> [round(x, 2) for x in ans]
        [nan, nan, 0.71, 1.0, 1.29]

        We can also control the gap before the expanding calculation.

        >>> import pandas as pd
        >>> expanding_std = ExpandingSTD(gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> ans = expanding_std(times, [5, 4, 3, 2, 1]).tolist()
        >>> [round(x, 2) for x in ans]
        [nan, 0.71, 1.0, 1.29, 1.58]

        We can also control the minimum number of periods required for the rolling calculation.

        >>> import pandas as pd
        >>> expanding_std = ExpandingSTD(min_periods=3)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> ans = expanding_std(times, [5, 4, 3, 2, 1]).tolist()
        >>> [round(x, 2) for x in ans]
        [nan, nan, nan, 1.0, 1.29]
    """

    name = "expanding_std"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, gap=1, min_periods=1):
        self.gap = gap
        self.min_periods = min_periods

    def get_function(self):
        def expanding_std(datetime, numeric):
            x = pd.Series(numeric.values, index=datetime)
            x = _apply_gap_for_expanding_primitives(x, self.gap)
            return x.expanding(min_periods=self.min_periods).std().values

        return expanding_std


================================================
FILE: featuretools/primitives/standard/transform/time_series/expanding/expanding_trend.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.transform.time_series.utils import (
    _apply_gap_for_expanding_primitives,
)
from featuretools.utils import calculate_trend


class ExpandingTrend(TransformPrimitive):
    """Computes the expanding trend for events over a given window.

    Description:
        Given a list of datetimes, returns the expanding trend starting
        at the row `gap` rows away from the current row. An expanding
        primitive calculates the value of a primitive for a given time
        with all the data available up to the corresponding point in time.

        Input datetimes should be monotonic.

    Args:
        gap (int, optional): Specifies a gap backwards from each instance before the
            usable data begins. Corresponds to number of rows. Defaults to 1.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Defaults to 1.


    Examples:
        >>> import pandas as pd
        >>> expanding_trend = ExpandingTrend()
        >>> times = pd.date_range(start='2019-01-01', freq='1D', periods=5)
        >>> ans = expanding_trend(times, [5, 4, 3, 2, 1]).tolist()
        >>> [round(x, 2) for x in ans]
        [nan, nan, nan, -1.0, -1.0]

        We can also control the gap before the expanding calculation.

        >>> import pandas as pd
        >>> expanding_trend = ExpandingTrend(gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1D', periods=5)
        >>> ans = expanding_trend(times, [5, 4, 3, 2, 1]).tolist()
        >>> [round(x, 2) for x in ans]
        [nan, nan, -1.0, -1.0, -1.0]

        We can also control the minimum number of periods required for the rolling calculation.

        >>> import pandas as pd
        >>> expanding_trend = ExpandingTrend(min_periods=3)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> ans = expanding_trend(times, [50, 4, 13, 22, 10]).tolist()
        >>> [round(x, 2) for x in ans]
        [nan, nan, nan, -18.5, -7.5]
    """

    name = "expanding_trend"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, gap=1, min_periods=1):
        self.gap = gap
        self.min_periods = min_periods

    def get_function(self):
        def expanding_trend(datetime, numeric):
            x = pd.Series(numeric.values, index=datetime)
            x = _apply_gap_for_expanding_primitives(x, self.gap)
            return (
                x.expanding(min_periods=self.min_periods)
                .aggregate(calculate_trend)
                .values
            )

        return expanding_trend


================================================
FILE: featuretools/primitives/standard/transform/time_series/lag.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable

from featuretools.primitives.base import TransformPrimitive


class Lag(TransformPrimitive):
    """Shifts an array of values by a specified number of periods.

    Args:
        periods (int): The number of periods by which to shift the input.
            Default is 1. Periods correspond to rows.

    Examples:
        >>> lag = Lag()
        >>> lag([1, 2, 3, 4, 5], pd.Series(pd.date_range(start="2020-01-01", periods=5, freq='D'))).tolist()
        [nan, 1.0, 2.0, 3.0, 4.0]

        You can specify the number of periods to shift the values

        >>> lag_periods = Lag(periods=3)
        >>> lag_periods([True, False, False, True, True], pd.Series(pd.date_range(start="2020-01-01", periods=5, freq='D'))).tolist()
        [nan, nan, nan, True, False]
    """

    # Note: with pandas 1.5.0, using Lag with a string input will result in `None` values
    # being introduced instead of `nan` values that were present in previous versions.
    # All missing values will be replaced by `np.nan` (for Double) or `pd.NA` (all other types)
    # once Woodwork is initialized on the feature matrix.
    name = "lag"
    input_types = [
        [
            ColumnSchema(semantic_tags={"category"}),
            ColumnSchema(semantic_tags={"time_index"}),
        ],
        [
            ColumnSchema(semantic_tags={"numeric"}),
            ColumnSchema(semantic_tags={"time_index"}),
        ],
        [
            ColumnSchema(logical_type=Boolean),
            ColumnSchema(semantic_tags={"time_index"}),
        ],
        [
            ColumnSchema(logical_type=BooleanNullable),
            ColumnSchema(semantic_tags={"time_index"}),
        ],
    ]
    return_type = None
    uses_full_dataframe = True

    def __init__(self, periods=1):
        self.periods = periods

    def get_function(self):
        def lag(input_col, time_index):
            x = pd.Series(input_col.values, index=time_index.values)
            return x.shift(periods=self.periods, fill_value=None).values

        return lag


================================================
FILE: featuretools/primitives/standard/transform/time_series/numeric_lag.py
================================================
import warnings

import pandas as pd
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import TransformPrimitive


class NumericLag(TransformPrimitive):
    """Shifts an array of values by a specified number of periods.

    Args:
        periods (int): The number of periods by which to shift the input.
            Default is 1. Periods correspond to rows.

        fill_value (int, float, optional): The value to use to fill in
            the gaps left after shifting the input. Default is None.

    Examples:
        >>> lag = NumericLag()
        >>> lag(pd.Series(pd.date_range(start="2020-01-01", periods=5, freq='D')), [1, 2, 3, 4, 5]).tolist()
        [nan, 1.0, 2.0, 3.0, 4.0]

        You can specify the number of periods to shift the values

        >>> lag_periods = NumericLag(periods=3)
        >>> lag_periods(pd.Series(pd.date_range(start="2020-01-01", periods=5, freq='D')), [1, 2, 3, 4, 5]).tolist()
        [nan, nan, nan, 1.0, 2.0]

        You can specify the fill value to use

        >>> lag_fill_value = NumericLag(fill_value=100)
        >>> lag_fill_value(pd.Series(pd.date_range(start="2020-01-01", periods=4, freq='D')), [1, 2, 3, 4]).tolist()
        [100, 1, 2, 3]
    """

    name = "numeric_lag"
    input_types = [
        ColumnSchema(semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, periods=1, fill_value=None):
        self.periods = periods
        self.fill_value = fill_value
        warnings.warn(
            "NumericLag is deprecated and will be removed in a future version. Please use the 'Lag' primitive instead.",
            FutureWarning,
        )

    def get_function(self):
        def lag(time_index, numeric):
            x = pd.Series(numeric.values, index=time_index.values)
            return x.shift(periods=self.periods, fill_value=self.fill_value).values

        return lag


================================================
FILE: featuretools/primitives/standard/transform/time_series/rolling_count.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.transform.time_series.utils import (
    apply_rolling_agg_to_series,
)


class RollingCount(TransformPrimitive):
    """Determines a rolling count of events over a given window.

    Description:
        Given a list of datetimes, return a rolling count starting
        at the row `gap` rows away from the current row and looking backward over the specified
        time window (by `window_length` and `gap`).

        Input datetimes should be monotonic.

    Args:
        window_length (int, string, optional): Specifies the amount of data included in each window.
            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,
            for example of one day, the window_length will correspond to a period of time, in this case,
            7 days for a window_length of 7.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time that each window should span.
            The list of available offset aliases can be found at
            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.
            Defaults to 3.
        gap (int, string, optional): Specifies a gap backwards from each instance before the
            window of usable data begins. If an integer is provided, it will correspond to a number of rows.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time between a target instance and the beginning of its window.
            Defaults to 1.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Can only be as large as window_length when window_length is an integer.
            When window_length is an offset alias string, this limitation does not exist, but care should be taken
            to not choose a min_periods that will always be larger than the number of observations in a window.
            Defaults to 1.

    Note:
        Only offset aliases with fixed frequencies can be used when defining gap and h.
        This means that aliases such as `M` or `W` cannot be used, as they can indicate different
        numbers of days. ('M', because different months have different numbers of days;
        'W' because week will indicate a certain day of the week, like W-Wed, so that will
        indicate a different number of days depending on the anchoring date.)

    Note:
        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.
        This limitation does not exist when using an offset alias to define `window_length`. In fact,
        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more
        efficient.

    Examples:
        >>> import pandas as pd
        >>> rolling_count = RollingCount(window_length=3)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_count(times).tolist()
        [nan, 1.0, 2.0, 3.0, 3.0]

        We can also control the gap before the rolling calculation.

        >>> import pandas as pd
        >>> rolling_count = RollingCount(window_length=3, gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_count(times).tolist()
        [1.0, 2.0, 3.0, 3.0, 3.0]

        We can also control the minimum number of periods required for the rolling calculation.

        >>> import pandas as pd
        >>> rolling_count = RollingCount(window_length=3, min_periods=3, gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_count(times).tolist()
        [nan, nan, 3.0, 3.0, 3.0]

        We can also set the window_length and gap using offset alias strings.
        >>> import pandas as pd
        >>> rolling_count = RollingCount(window_length='3min', gap='1min')
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_count(times).tolist()
        [nan, 1.0, 2.0, 3.0, 3.0]

    """

    name = "rolling_count"
    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, window_length=3, gap=1, min_periods=0):
        self.window_length = window_length
        self.gap = gap
        self.min_periods = min_periods

    def get_function(self):
        def rolling_count(datetime):
            x = pd.Series(1, index=datetime)
            return apply_rolling_agg_to_series(
                x,
                lambda series: series.count(),
                self.window_length,
                self.gap,
                self.min_periods,
                ignore_window_nans=True,
            )

        return rolling_count


================================================
FILE: featuretools/primitives/standard/transform/time_series/rolling_max.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.transform.time_series.utils import (
    apply_rolling_agg_to_series,
)


class RollingMax(TransformPrimitive):
    """Determines the maximum of entries over a given window.

    Description:
        Given a list of numbers and a corresponding list of
        datetimes, return a rolling maximum of the numeric values,
        starting at the row `gap` rows away from the current row and looking backward
        over the specified window (by `window_length` and `gap`).

        Input datetimes should be monotonic.

    Args:
        window_length (int, string, optional): Specifies the amount of data included in each window.
            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,
            for example of one day, the window_length will correspond to a period of time, in this case,
            7 days for a window_length of 7.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time that each window should span.
            The list of available offset aliases can be found at
            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.
            Defaults to 3.
        gap (int, string, optional): Specifies a gap backwards from each instance before the
            window of usable data begins. If an integer is provided, it will correspond to a number of rows.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time between a target instance and the beginning of its window.
            Defaults to 1.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Can only be as large as window_length when window_length is an integer.
            When window_length is an offset alias string, this limitation does not exist, but care should be taken
            to not choose a min_periods that will always be larger than the number of observations in a window.
            Defaults to 1.

    Note:
        Only offset aliases with fixed frequencies can be used when defining gap and window_length.
        This means that aliases such as `M` or `W` cannot be used, as they can indicate different
        numbers of days. ('M', because different months have different numbers of days;
        'W' because week will indicate a certain day of the week, like W-Wed, so that will
        indicate a different number of days depending on the anchoring date.)

    Note:
        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.
        This limitation does not exist when using an offset alias to define `window_length`. In fact,
        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more
        efficient.

    Examples:
        >>> import pandas as pd
        >>> rolling_max = RollingMax(window_length=3)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_max(times, [4, 3, 2, 1, 0]).tolist()
        [nan, 4.0, 4.0, 4.0, 3.0]

        We can also control the gap before the rolling calculation.

        >>> import pandas as pd
        >>> rolling_max = RollingMax(window_length=3, gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_max(times, [4, 3, 2, 1, 0]).tolist()
        [4.0, 4.0, 4.0, 3.0, 2.0]

        We can also control the minimum number of periods required for the rolling calculation.

        >>> import pandas as pd
        >>> rolling_max = RollingMax(window_length=3, min_periods=3, gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_max(times, [4, 3, 2, 1, 0]).tolist()
        [nan, nan, 4.0, 3.0, 2.0]

        We can also set the window_length and gap using offset alias strings.

        >>> import pandas as pd
        >>> rolling_max = RollingMax(window_length='3min', gap='1min')
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_max(times, [4, 3, 2, 1, 0]).tolist()
        [nan, 4.0, 4.0, 4.0, 3.0]
    """

    name = "rolling_max"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, window_length=3, gap=1, min_periods=1):
        self.window_length = window_length
        self.gap = gap
        self.min_periods = min_periods

    def get_function(self):
        def rolling_max(datetime, numeric):
            x = pd.Series(numeric.values, index=datetime.values)
            return apply_rolling_agg_to_series(
                x,
                lambda series: series.max(),
                self.window_length,
                self.gap,
                self.min_periods,
            )

        return rolling_max


================================================
FILE: featuretools/primitives/standard/transform/time_series/rolling_mean.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.transform.time_series.utils import (
    apply_rolling_agg_to_series,
)


class RollingMean(TransformPrimitive):
    """Calculates the mean of entries over a given window.

    Description:
        Given a list of numbers and a corresponding list of
        datetimes, return a rolling mean of the numeric values,
        starting at the row `gap` rows away from the current row and looking backward
        over the specified time window (by `window_length` and `gap`).

        Input datetimes should be monotonic.

    Args:
        window_length (int, string, optional): Specifies the amount of data included in each window.
            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,
            for example of one day, the window_length will correspond to a period of time, in this case,
            7 days for a window_length of 7.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time that each window should span.
            The list of available offset aliases can be found at
            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.
            Defaults to 3.
        gap (int, string, optional): Specifies a gap backwards from each instance before the
            window of usable data begins. If an integer is provided, it will correspond to a number of rows.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time between a target instance and the beginning of its window.
            Defaults to 1.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Can only be as large as window_length when window_length is an integer.
            When window_length is an offset alias string, this limitation does not exist, but care should be taken
            to not choose a min_periods that will always be larger than the number of observations in a window.
            Defaults to 1.

    Note:
        Only offset aliases with fixed frequencies can be used when defining gap and window_length.
        This means that aliases such as `M` or `W` cannot be used, as they can indicate different
        numbers of days. ('M', because different months have different numbers of days;
        'W' because week will indicate a certain day of the week, like W-Wed, so that will
        indicate a different number of days depending on the anchoring date.)

    Note:
        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.
        This limitation does not exist when using an offset alias to define `window_length`. In fact,
        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more
        efficient.

    Examples:
        >>> import pandas as pd
        >>> rolling_mean = RollingMean(window_length=3)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_mean(times, [4, 3, 2, 1, 0]).tolist()
        [nan, 4.0, 3.5, 3.0, 2.0]

        We can also control the gap before the rolling calculation.

        >>> import pandas as pd
        >>> rolling_mean = RollingMean(window_length=3, gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_mean(times, [4, 3, 2, 1, 0]).tolist()
        [4.0, 3.5, 3.0, 2.0, 1.0]

        We can also control the minimum number of periods required for the rolling calculation.

        >>> import pandas as pd
        >>> rolling_mean = RollingMean(window_length=3, min_periods=3, gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_mean(times, [4, 3, 2, 1, 0]).tolist()
        [nan, nan, 3.0, 2.0, 1.0]
    """

    name = "rolling_mean"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, window_length=3, gap=1, min_periods=0):
        self.window_length = window_length
        self.gap = gap
        self.min_periods = min_periods

    def get_function(self):
        def rolling_mean(datetime, numeric):
            x = pd.Series(numeric.values, index=datetime.values)
            return apply_rolling_agg_to_series(
                x,
                np.mean,
                self.window_length,
                self.gap,
                self.min_periods,
            )

        return rolling_mean


================================================
FILE: featuretools/primitives/standard/transform/time_series/rolling_min.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.transform.time_series.utils import (
    apply_rolling_agg_to_series,
)


class RollingMin(TransformPrimitive):
    """Determines the minimum of entries over a given window.

    Description:
        Given a list of numbers and a corresponding list of
        datetimes, return a rolling minimum of the numeric values,
        starting at the row `gap` rows away from the current row and looking backward
        over the specified window (by `window_length` and `gap`).
        Input datetimes should be monotonic.

    Args:
        window_length (int, string, optional): Specifies the amount of data included in each window.
            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,
            for example of one day, the window_length will correspond to a period of time, in this case,
            7 days for a window_length of 7.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time that each window should span.
            The list of available offset aliases can be found at
            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.
            Defaults to 3.
        gap (int, string, optional): Specifies a gap backwards from each instance before the
            window of usable data begins. If an integer is provided, it will correspond to a number of rows.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time between a target instance and the beginning of its window.
            Defaults to 1.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Can only be as large as window_length when window_length is an integer.
            When window_length is an offset alias string, this limitation does not exist, but care should be taken
            to not choose a min_periods that will always be larger than the number of observations in a window.
            Defaults to 1.

    Note:
        Only offset aliases with fixed frequencies can be used when defining gap and window_length.
        This means that aliases such as `M` or `W` cannot be used, as they can indicate different
        numbers of days. ('M', because different months have different numbers of days;
        'W' because week will indicate a certain day of the week, like W-Wed, so that will
        indicate a different number of days depending on the anchoring date.)

    Note:
        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.
        This limitation does not exist when using an offset alias to define `window_length`. In fact,
        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more
        efficient.

    Examples:
        >>> import pandas as pd
        >>> rolling_min = RollingMin(window_length=3)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_min(times, [4, 3, 2, 1, 0]).tolist()
        [nan, 4.0, 3.0, 2.0, 1.0]

        We can also control the gap before the rolling calculation.

        >>> import pandas as pd
        >>> rolling_min = RollingMin(window_length=3, gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_min(times, [4, 3, 2, 1, 0]).tolist()
        [4.0, 3.0, 2.0, 1.0, 0.0]

        We can also control the minimum number of periods required for the rolling calculation.

        >>> import pandas as pd
        >>> rolling_min = RollingMin(window_length=3, min_periods=3, gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_min(times, [4, 3, 2, 1, 0]).tolist()
        [nan, nan, 2.0, 1.0, 0.0]

        We can also set the window_length and gap using offset alias strings.

        >>> import pandas as pd
        >>> rolling_min = RollingMin(window_length='3min', gap='1min')
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_min(times, [4, 3, 2, 1, 0]).tolist()
        [nan, 4.0, 3.0, 2.0, 1.0]
    """

    name = "rolling_min"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, window_length=3, gap=1, min_periods=1):
        self.window_length = window_length
        self.gap = gap
        self.min_periods = min_periods

    def get_function(self):
        def rolling_min(datetime, numeric):
            x = pd.Series(numeric.values, index=datetime.values)
            return apply_rolling_agg_to_series(
                x,
                lambda series: series.min(),
                self.window_length,
                self.gap,
                self.min_periods,
            )

        return rolling_min


================================================
FILE: featuretools/primitives/standard/transform/time_series/rolling_outlier_count.py
================================================
import numpy as np
import pandas as pd
from woodwork import init_series
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.transform.time_series.utils import (
    apply_rolling_agg_to_series,
)


class RollingOutlierCount(TransformPrimitive):
    """Determines how many values are outliers over a given window.

    Description:
        Given a list of numbers and a corresponding list of
        datetimes, return a rolling count of outliers within the numeric values,
        starting at the row `gap` rows away from the current row and looking backward
        over the specified window (by `window_length` and `gap`). Values are deemed
        outliers using the IQR method, computed over the whole series.
        Input datetimes should be monotonic.

    Args:
        window_length (int, string, optional): Specifies the amount of data included in each window.
            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling
            frequency, for example of one day, the window_length will correspond to a period of time, in this case,
            7 days for a window_length of 7.
            If a string is provided, it must be one of Pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time that each window should span.
            The list of available offset aliases can be found at
            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.
            Defaults to 3.
        gap (int, string, optional): Specifies a gap backwards from each instance before the
            window of usable data begins. If an integer is provided, it will correspond to a number of rows.
            If a string is provided, it must be one of Pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time between a target instance and the beginning of its window.
            Defaults to 1, which excludes the target instance from the window.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Can only be as large as window_length when window_length is an integer.
            When window_length is an offset alias string, this limitation does not exist, but care should be taken
            to not choose a min_periods that will always be larger than the number of observations in a window.
            Defaults to 1.

    Note:
        Only offset aliases with fixed frequencies can be used when defining gap and window_length.
        This means that aliases such as `M` or `W` cannot be used, as they can indicate different
        numbers of days. ('M', because different months are different numbers of days;
        'W' because week will indicate a certain day of the week, like W-Wed, so that will
        indicate a different number of days depending on the anchoring date.)

    Note:
        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.
        This limitation does not exist when using an offset alias to define `window_length`. In fact,
        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more
        efficient.

    Examples:
        >>> import pandas as pd
        >>> rolling_outlier_count = RollingOutlierCount(window_length=4)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=6)
        >>> rolling_outlier_count(times, [0, 0, 0, 0, 10, 0]).tolist()
        [nan, 0.0, 0.0, 0.0, 0.0, 1.0]

        We can also control the gap before the rolling calculation.
        >>> import pandas as pd
        >>> rolling_outlier_count = RollingOutlierCount(window_length=4, gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=6)
        >>> rolling_outlier_count(times, [0, 0, 0, 0, 10, 0]).tolist()
        [0.0, 0.0, 0.0, 0.0, 1.0, 1.0]

        We can also control the minimum number of periods required for the rolling calculation.
        >>> import pandas as pd
        >>> rolling_outlier_count = RollingOutlierCount(window_length=4, min_periods=3)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=6)
        >>> rolling_outlier_count(times,  [0, 0, 0, 0, 10, 0]).tolist()
        [nan, nan, nan, 0.0, 0.0, 1.0]

        We can also set the window_length and gap using offset alias strings.
        >>> import pandas as pd
        >>> rolling_outlier_count = RollingOutlierCount(window_length='4min', gap='1min')
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=6)
        >>> rolling_outlier_count(times, [0, 0, 0, 0, 10, 0]).tolist()
        [nan, 0.0, 0.0, 0.0, 0.0, 1.0]
    """

    name = "rolling_outlier_count"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, window_length=3, gap=1, min_periods=0):
        self.window_length = window_length
        self.gap = gap
        self.min_periods = min_periods

    def get_outliers_count(self, numeric_series):
        # We know the column is numeric, so use the Double logical type in case Woodwork's
        # type inference could not infer a numeric type
        if not len(numeric_series.dropna()):
            return np.nan
        if numeric_series.ww.schema is None:
            numeric_series = init_series(numeric_series, logical_type="Double")
        box_plot_info = numeric_series.ww.box_plot_dict()
        return len(box_plot_info["high_values"]) + len(box_plot_info["low_values"])

    def get_function(self):
        def rolling_outlier_count(datetime, numeric):
            x = pd.Series(numeric.values, index=datetime.values)
            return apply_rolling_agg_to_series(
                series=x,
                agg_func=self.get_outliers_count,
                window_length=self.window_length,
                gap=self.gap,
                min_periods=self.min_periods,
                ignore_window_nans=False,
            )

        return rolling_outlier_count


================================================
FILE: featuretools/primitives/standard/transform/time_series/rolling_std.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.transform.time_series.utils import (
    apply_rolling_agg_to_series,
)


class RollingSTD(TransformPrimitive):
    """Calculates the standard deviation of entries over a given window.

    Description:
        Given a list of numbers and a corresponding list of
        datetimes, return a rolling standard deviation of
        the numeric values, starting at the row `gap` rows away from the current row and
        looking backward over the specified time window
        (by `window_length` and `gap`). Input datetimes should be monotonic.

    Args:
        window_length (int, string, optional): Specifies the amount of data included in each window.
            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,
            for example of one day, the window_length will correspond to a period of time, in this case,
            7 days for a window_length of 7.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time that each window should span.
            The list of available offset aliases can be found at
            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.
            Defaults to 3.
        gap (int, string, optional): Specifies a gap backwards from each instance before the
            window of usable data begins. If an integer is provided, it will correspond to a number of rows.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time between a target instance and the beginning of its window.
            Defaults to 1.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Can only be as large as window_length when window_length is an integer.
            When window_length is an offset alias string, this limitation does not exist, but care should be taken
            to not choose a min_periods that will always be larger than the number of observations in a window.
            Defaults to 1.

    Note:
        Only offset aliases with fixed frequencies can be used when defining gap and window_length.
        This means that aliases such as `M` or `W` cannot be used, as they can indicate different
        numbers of days. ('M', because different months have different numbers of days;
        'W' because week will indicate a certain day of the week, like W-Wed, so that will
        indicate a different number of days depending on the anchoring date.)

    Note:
        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.
        This limitation does not exist when using an offset alias to define `window_length`. In fact,
        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more
        efficient.

    Examples:
        >>> import pandas as pd
        >>> rolling_std = RollingSTD(window_length=4)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_std(times, [4, 3, 2, 1, 0]).tolist()
        [nan, nan, 0.7071067811865476, 1.0, 1.2909944487358056]

        We can also control the gap before the rolling calculation.

        >>> import pandas as pd
        >>> rolling_std = RollingSTD(window_length=4, gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_std(times, [4, 3, 2, 1, 0]).tolist()
        [nan, 0.7071067811865476, 1.0, 1.2909944487358056, 1.2909944487358056]

        We can also control the minimum number of periods required for the rolling calculation.

        >>> import pandas as pd
        >>> rolling_std = RollingSTD(window_length=4, min_periods=4, gap=0)
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_std(times, [4, 3, 2, 1, 0]).tolist()
        [nan, nan, nan, 1.2909944487358056, 1.2909944487358056]

        We can also set the window_length and gap using offset alias strings.
        >>> import pandas as pd
        >>> rolling_std = RollingSTD(window_length='4min', gap='1min')
        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)
        >>> rolling_std(times, [4, 3, 2, 1, 0]).tolist()
        [nan, nan, 0.7071067811865476, 1.0, 1.2909944487358056]
    """

    name = "rolling_std"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, window_length=3, gap=1, min_periods=1):
        self.window_length = window_length
        self.gap = gap
        self.min_periods = min_periods

    def get_function(self):
        def rolling_std(datetime, numeric):
            x = pd.Series(numeric.values, index=datetime.values)
            return apply_rolling_agg_to_series(
                x,
                lambda series: series.std(),
                self.window_length,
                self.gap,
                self.min_periods,
            )

        return rolling_std


================================================
FILE: featuretools/primitives/standard/transform/time_series/rolling_trend.py
================================================
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Double

from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.standard.transform.time_series.utils import (
    apply_rolling_agg_to_series,
)
from featuretools.utils import calculate_trend


class RollingTrend(TransformPrimitive):
    """Calculates the trend of a given window of entries of a column over time.

    Description:
        Given a list of numbers and a corresponding list of
        datetimes, return a rolling slope of the linear trend
        of values, starting at the row `gap` rows away from the current row and looking backward
        over the specified time window (by `window_length` and `gap`).

        Input datetimes should be monotonic.

     Args:
        window_length (int, string, optional): Specifies the amount of data included in each window.
            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,
            for example of one day, the window_length will correspond to a period of time, in this case,
            7 days for a window_length of 7.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time that each window should span.
            The list of available offset aliases can be found at
            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.
            Defaults to 3.
        gap (int, string, optional): Specifies a gap backwards from each instance before the
            window of usable data begins. If an integer is provided, it will correspond to a number of rows.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time between a target instance and the beginning of its window.
            Defaults to 1.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Can only be as large as window_length when window_length is an integer.
            When window_length is an offset alias string, this limitation does not exist, but care should be taken
            to not choose a min_periods that will always be larger than the number of observations in a window.
            Defaults to 1.

    Examples:
        >>> import pandas as pd
        >>> rolling_trend = RollingTrend()
        >>> times = pd.date_range(start="2019-01-01", freq="1D", periods=10)
        >>> rolling_trend(times, [1, 2, 4, 8, 16, 24, 48, 96, 192, 384]).tolist()
        [nan, nan, nan, 1.4999999999999998, 2.9999999999999996, 5.999999999999999, 7.999999999999999, 16.0, 36.0, 72.0]

        We can also control the gap before the rolling calculation.

        >>> rolling_trend = RollingTrend(gap=0)
        >>> rolling_trend(times, [1, 2, 4, 8, 16, 24, 48, 96, 192, 384]).tolist()
        [nan, nan, 1.4999999999999998, 2.9999999999999996, 5.999999999999999, 7.999999999999999, 16.0, 36.0, 72.0, 144.0]

        We can also control the minimum number of periods required for the rolling calculation.

        >>> rolling_trend = RollingTrend(window_length=4, min_periods=4, gap=0)
        >>> rolling_trend(times, [1, 2, 4, 8, 16, 24, 48, 96, 192, 384]).tolist()
        [nan, nan, nan, 2.299999999999999, 4.599999999999998, 6.799999999999996, 12.799999999999992, 26.399999999999984, 55.19999999999997, 110.39999999999993]

        We can also set the window_length and gap using offset alias strings.

        >>> rolling_trend = RollingTrend(window_length="4D", gap="1D")
        >>> rolling_trend(times, [1, 2, 4, 8, 16, 24, 48, 96, 192, 384]).tolist()
        [nan, nan, nan, 1.4999999999999998, 2.299999999999999, 4.599999999999998, 6.799999999999996, 12.799999999999992, 26.399999999999984, 55.19999999999997]
    """

    name = "rolling_trend"
    input_types = [
        ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ColumnSchema(semantic_tags={"numeric"}),
    ]
    return_type = ColumnSchema(logical_type=Double, semantic_tags={"numeric"})
    uses_full_dataframe = True

    def __init__(self, window_length=3, gap=1, min_periods=0):
        self.window_length = window_length
        self.gap = gap
        self.min_periods = min_periods

    def get_function(self):
        def rolling_trend(datetime, numeric):
            x = pd.Series(numeric.values, index=datetime.values)
            return apply_rolling_agg_to_series(
                x,
                calculate_trend,
                self.window_length,
                self.gap,
                self.min_periods,
            )

        return rolling_trend


================================================
FILE: featuretools/primitives/standard/transform/time_series/utils.py
================================================
from typing import Callable, Optional, Union

import numpy as np
import pandas as pd
from pandas import Series
from pandas.core.window.rolling import Rolling
from pandas.tseries.frequencies import to_offset


def roll_series_with_gap(
    series: Series,
    window_length: Union[int, str],
    gap: Union[int, str],
    min_periods: int,
) -> Rolling:
    """Provide rolling window calculations where the windows are determined using both a gap parameter
    that indicates the amount of time between each instance and its window and a window length parameter
    that determines the amount of data in each window.

    Args:
        series (Series): The series over which rolling windows will be created. The series must have numeric values and a DatetimeIndex.
        window_length (int, string): Specifies the amount of data included in each window.
            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,
            for example of one day, the window_length will correspond to a period of time, in this case,
            7 days for a window_length of 7.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time that each window should span.
            The list of available offset aliases can be found at
            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
        gap (int, string, optional): Specifies a gap backwards from each instance before the
            window of usable data begins. If an integer is provided, it will correspond to a number of rows.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time between a target instance and the beginning of its window.
            Defaults to 0, which will include the target instance in the window.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Can only be as large as window_length when window_length is an integer.
            When window_length is an offset alias string, this limitation does not exist, but care should be taken
            to not choose a min_periods that will always be larger than the number of observations in a window.
            Defaults to 1.

    Returns:
        pandas.core.window.rolling.Rolling: The Rolling object for the series passed in.
    """
    _check_window_length(window_length)
    _check_gap(window_length, gap)

    functional_window_length = window_length
    if isinstance(gap, str):
        # Add the window_length and gap so that the rolling operation correctly takes gap into account.
        # That way, we can later remove the gap rows in order to apply the primitive function
        # to the correct window
        functional_window_length = to_offset(window_length) + to_offset(gap)
    elif gap > 0:
        # When gap is numeric, we can apply a shift to incorporate gap right now
        # since the gap will be the same number of rows for the whole dataset
        series = series.shift(gap)

    return series.rolling(functional_window_length, min_periods)


def _get_rolled_series_without_gap(window: Series, gap_offset: str) -> Series:
    """Applies the gap offset_string to the rolled window, returning a window
    that is the correct length of time away from the original instance.

    Args:
        window (Series): A rolling window that includes both the window length and gap spans of time.
        gap_offset (string): The pandas offset alias that determines how much time at the end of the window
            should be removed.

    Returns:
        Series: The window with gap rows removed
    """
    if not len(window):
        return window

    window_start_date = window.index[0]
    window_end_date = window.index[-1]

    gap_bound = window_end_date - to_offset(gap_offset)

    # If the gap is larger than the series, no rows are left in the window
    if gap_bound < window_start_date:
        return Series(dtype="float64")

    # Only return the rows that are within the offset's bounds
    return window[window.index <= gap_bound]


def apply_roll_with_offset_gap(
    window: Series,
    gap_offset: str,
    reducer_fn: Callable[[Series], float],
    min_periods: int,
) -> float:
    """Takes in a series to which an offset gap will be applied, removing however many
    rows fall under the gap before applying the reducing function.

    Args:
        window (Series):  A rolling window that includes both the window length and gap spans of time.
        gap_offset (string): The pandas offset alias that determines how much time at the end of the window
            should be removed.
        reducer_fn (callable[Series -> float]): The function to be applied to the window in order to produce
            the aggregate that will be included in the resulting feature.
        min_periods (int): Minimum number of observations required for performing calculations
            over the window.

    Returns:
        float: The aggregate value to be used as a feature value.
    """
    window = _get_rolled_series_without_gap(window, gap_offset)

    if min_periods is None:
        min_periods = 1

    if len(window) < min_periods or not len(window):
        return np.nan

    return reducer_fn(window)


def _check_window_length(window_length: Union[int, str]) -> None:
    # Window length must either be a valid offset alias
    if isinstance(window_length, str):
        try:
            to_offset(window_length)
        except ValueError:
            raise ValueError(
                f"Cannot roll series. The specified window length, {window_length}, is not a valid offset alias.",
            )
    # Or an integer greater than zero
    elif isinstance(window_length, int):
        if window_length <= 0:
            raise ValueError("Window length must be greater than zero.")
    else:
        raise TypeError("Window length must be either an offset string or an integer.")


def _check_gap(window_length: Union[int, str], gap: Union[int, str]) -> None:
    # Gap must either be a valid offset string that also has an offset string window length
    if isinstance(gap, str):
        if not isinstance(window_length, str):
            raise TypeError(
                f"Cannot roll series with offset gap, {gap}, and numeric window length, {window_length}. "
                "If an offset alias is used for gap, the window length must also be defined as an offset alias. "
                "Please either change gap to be numeric or change window length to be an offset alias.",
            )
        try:
            to_offset(gap)
        except ValueError:
            raise ValueError(
                f"Cannot roll series. The specified gap, {gap}, is not a valid offset alias.",
            )
    # Or an integer greater than or equal to zero
    elif isinstance(gap, int):
        if gap < 0:
            raise ValueError("Gap must be greater than or equal to zero.")
    else:
        raise TypeError("Gap must be either an offset string or an integer.")


def apply_rolling_agg_to_series(
    series: Series,
    agg_func: Callable[[Series], float],
    window_length: Union[int, str],
    gap: Union[int, str] = 0,
    min_periods: int = 1,
    ignore_window_nans: bool = False,
) -> np.ndarray:
    """Applies a given aggregation function to a rolled series.

    Args:
        series (Series): The series over which rolling windows will be created. The series must have numeric values and a DatetimeIndex.
        agg_func (callable[Series -> float]): The aggregation function to apply to a rolled series.
        window_length (int, string): Specifies the amount of data included in each window.
            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,
            for example of one day, the window_length will correspond to a period of time, in this case,
            7 days for a window_length of 7.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time that each window should span.
            The list of available offset aliases can be found at
            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
        gap (int, string, optional): Specifies a gap backwards from each instance before the
            window of usable data begins. If an integer is provided, it will correspond to a number of rows.
            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),
            and it will indicate a length of time between a target instance and the beginning of its window.
            Defaults to 0, which will include the target instance in the window.
        min_periods (int, optional): Minimum number of observations required for performing calculations
            over the window. Can only be as large as window_length when window_length is an integer.
            When window_length is an offset alias string, this limitation does not exist, but care should be taken
            to not choose a min_periods that will always be larger than the number of observations in a window.
            Defaults to 1.
        ignore_window_nans (bool, optional): Whether or not NaNs in the rolling window should be included in the rolling calculation.
            NaNs by default get counted towards min_periods. When set to True,
            all partial values calculated by `agg_func` in the rolling window get replaced with NaN.
            Defaults to False.

    Returns:
        numpy.ndarray: The array of rolling calculated values.

    Note:
        Certain operations, like `pandas.core.window.rolling.Rolling.count` that can be performed
        on the Rolling object returned here may treat NaNs as periods to include in window calculations.
        So a window [NaN, 1, 3]  when `min_periods=3` will proceed with count, saying there are three periods
        but only two values and would return count=2. The calculation `max` on the other hand,
        would not recognize NaN as a valid period, and would therefore return `max=NaN` as the window has
        less valid periods (two, in this case) than `min_periods` (three, in this case).
        Most rolling calculations act this way. The implication of that here is that in order to
        achieve the gap, we insert NaNs at the beginning of the series, which would cause `count` to calculate
        on windows that technically should not have the correct number of periods. Any primitive that uses this function
        should determine whether `ignore_window_nans` should be set to `true`.

    Note:
        Only offset aliases with fixed frequencies can be used when defining gap and window_length.
        This means that aliases such as `M` or `W` cannot be used, as they can indicate different
        numbers of days. ('M', because different months have different numbers of days;
        'W' because week will indicate a certain day of the week, like W-Wed, so that will
        indicate a different number of days depending on the anchoring date.)

    Note:
        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.
        This limitation does not exist when using an offset alias to define `window_length`. In fact,
        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more
        efficient."""
    rolled_series = roll_series_with_gap(series, window_length, gap, min_periods)
    if isinstance(gap, str):
        additional_args = (gap, agg_func, min_periods)
        return rolled_series.apply(
            apply_roll_with_offset_gap,
            args=additional_args,
        ).values
    applied_rolled_series = rolled_series.apply(agg_func)

    if ignore_window_nans:
        if not min_periods:
            # when min periods is 0 or None it's treated the same as if it's 1
            num_nans = gap
        else:
            num_nans = min_periods - 1 + gap
        applied_rolled_series.iloc[range(num_nans)] = np.nan
    return applied_rolled_series.values


def _apply_gap_for_expanding_primitives(
    x: Union[Series, pd.Index],
    gap: Union[int, str],
) -> Optional[Series]:
    if not isinstance(gap, int):
        raise TypeError(
            "String offsets are not supported for the gap parameter in Expanding primitives",
        )
    if isinstance(x, pd.Index):
        return x.to_series().shift(gap)
    return x.shift(gap)


================================================
FILE: featuretools/primitives/standard/transform/url/__init__.py
================================================
from featuretools.primitives.standard.transform.url.url_to_domain import URLToDomain
from featuretools.primitives.standard.transform.url.url_to_protocol import URLToProtocol
from featuretools.primitives.standard.transform.url.url_to_tld import URLToTLD


================================================
FILE: featuretools/primitives/standard/transform/url/url_to_domain.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import URL, Categorical

from featuretools.primitives.base import TransformPrimitive


class URLToDomain(TransformPrimitive):
    """Determines the domain of a url.

    Description:
        Calculates the label to identify the network domain of a URL. Supports
        urls with or without protocol as well as international country domains.

    Examples:
        >>> url_to_domain = URLToDomain()
        >>> urls =  ['https://play.google.com',
        ...          'http://www.google.co.in',
        ...          'www.facebook.com']
        >>> url_to_domain(urls).tolist()
        ['play.google.com', 'google.co.in', 'facebook.com']
    """

    name = "url_to_domain"
    input_types = [ColumnSchema(logical_type=URL)]
    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})

    def get_function(self):
        def url_to_domain(x):
            p = r"^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)"
            return x.str.extract(p, expand=False)

        return url_to_domain


================================================
FILE: featuretools/primitives/standard/transform/url/url_to_protocol.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import URL, Categorical

from featuretools.primitives.base import TransformPrimitive


class URLToProtocol(TransformPrimitive):
    """Determines the protocol (http or https) of a url.

    Description:
        Extract the protocol of a url using regex.
        It will be either https or http. Returns nan if
        the url doesn't contain a protocol.

    Examples:
        >>> url_to_protocol = URLToProtocol()
        >>> urls =  ['https://play.google.com',
        ...          'http://www.google.co.in',
        ...          'www.facebook.com']
        >>> url_to_protocol(urls).to_list()
        ['https', 'http', nan]
    """

    name = "url_to_protocol"
    input_types = [ColumnSchema(logical_type=URL)]
    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})

    def get_function(self):
        def url_to_protocol(x):
            p = r"^(https|http)(?:\:)"
            return x.str.extract(p, expand=False)

        return url_to_protocol


================================================
FILE: featuretools/primitives/standard/transform/url/url_to_tld.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import URL, Categorical

from featuretools.primitives.base import TransformPrimitive
from featuretools.utils.common_tld_utils import COMMON_TLDS


class URLToTLD(TransformPrimitive):
    """Determines the top level domain of a url.

    Description:
        Extract the top level domain of a url, using regex,
        and a list of common top level domains. Returns nan if
        the url is invalid or null.
        Common top level domains were pulled from this list:
        https://www.hayksaakian.com/most-popular-tlds/

    Examples:
        >>> url_to_tld = URLToTLD()
        >>> urls = ['https://www.google.com', 'http://www.google.co.in',
        ...         'www.facebook.com']
        >>> url_to_tld(urls).to_list()
        ['com', 'in', 'com']
    """

    name = "url_to_tld"
    input_types = [ColumnSchema(logical_type=URL)]
    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={"category"})

    def get_function(self):
        self.tlds_pattern = r"(?:\.({}))".format("|".join(COMMON_TLDS))

        def url_to_domain(x):
            p = r"^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)"
            return x.str.extract(p, expand=False)

        def url_to_tld(x):
            domains = url_to_domain(x)
            df = domains.str.extractall(self.tlds_pattern)
            matches = df.groupby(level=0).last()[0]
            return matches.reindex(x.index)

        return url_to_tld


================================================
FILE: featuretools/primitives/utils.py
================================================
import importlib.util
import os
from inspect import getfullargspec, getsource, isclass
from typing import Dict, List

import pandas as pd
from woodwork import list_logical_types, list_semantic_tags, type_system
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import NaturalLanguage

import featuretools
from featuretools.primitives import NumberOfCommonWords
from featuretools.primitives.base import (
    AggregationPrimitive,
    PrimitiveBase,
    TransformPrimitive,
)
from featuretools.utils.gen_utils import find_descendents


def _get_primitives(primitive_kind):
    """Helper function that selects all primitives
    that are instances of `primitive_kind`
    """
    primitives = set()
    for attribute_string in dir(featuretools.primitives):
        attribute = getattr(featuretools.primitives, attribute_string)
        if isclass(attribute):
            if issubclass(attribute, primitive_kind) and attribute.name:
                primitives.add(attribute)
    return {prim.name.lower(): prim for prim in primitives}


def get_aggregation_primitives():
    """Returns all aggregation primitives"""
    return _get_primitives(featuretools.primitives.AggregationPrimitive)


def get_transform_primitives():
    """Returns all transform primitives"""
    return _get_primitives(featuretools.primitives.TransformPrimitive)


def get_all_primitives():
    """Helper function to return all primitives"""
    primitives = set()
    for attribute_string in dir(featuretools.primitives):
        attribute = getattr(featuretools.primitives, attribute_string)
        if isclass(attribute):
            if issubclass(attribute, PrimitiveBase) and attribute.name:
                primitives.add(attribute)
    return {prim.__name__: prim for prim in primitives}


def _get_natural_language_primitives():
    """Returns all Natural Language transform primitives"""
    transform_primitives = get_transform_primitives()

    def _natural_language_in_input_type(primitive):
        for input_type in primitive.input_types:
            if isinstance(input_type, list):
                if any(
                    isinstance(column_schema.logical_type, NaturalLanguage)
                    for column_schema in input_type
                ):
                    return True
            else:
                if isinstance(input_type.logical_type, NaturalLanguage):
                    return True
        return False

    return {
        name: primitive
        for name, primitive in transform_primitives.items()
        if _natural_language_in_input_type(primitive)
    }


def list_primitives():
    """Returns a DataFrame that lists and describes each built-in primitive."""
    trans_names, trans_primitives, valid_inputs, return_type = _get_names_primitives(
        get_transform_primitives,
    )
    transform_df = pd.DataFrame(
        {
            "name": trans_names,
            "description": _get_descriptions(trans_primitives),
            "valid_inputs": valid_inputs,
            "return_type": return_type,
        },
    )
    transform_df["type"] = "transform"

    agg_names, agg_primitives, valid_inputs, return_type = _get_names_primitives(
        get_aggregation_primitives,
    )
    agg_df = pd.DataFrame(
        {
            "name": agg_names,
            "description": _get_descriptions(agg_primitives),
            "valid_inputs": valid_inputs,
            "return_type": return_type,
        },
    )
    agg_df["type"] = "aggregation"

    columns = [
        "name",
        "type",
        "description",
        "valid_inputs",
        "return_type",
    ]
    return pd.concat([agg_df, transform_df], ignore_index=True)[columns]


def summarize_primitives() -> pd.DataFrame:
    """Returns a metrics summary DataFrame of all primitives found in list_primitives."""
    (
        trans_names,
        trans_primitives,
        trans_valid_inputs,
        trans_return_type,
    ) = _get_names_primitives(get_transform_primitives)

    (
        agg_names,
        agg_primitives,
        agg_valid_inputs,
        agg_return_type,
    ) = _get_names_primitives(get_aggregation_primitives)

    tot_trans = len(trans_names)
    tot_agg = len(agg_names)
    tot_prims = tot_trans + tot_agg
    all_primitives = trans_primitives + agg_primitives
    primitives_summary = _get_summary_primitives(all_primitives)
    summary_dict = {
        "total_primitives": tot_prims,
        "aggregation_primitives": tot_agg,
        "transform_primitives": tot_trans,
        **primitives_summary["general_metrics"],
    }
    summary_dict.update(
        {
            f"uses_{ltype}_input": count
            for ltype, count in primitives_summary["logical_type_input_metrics"].items()
        },
    )
    summary_dict.update(
        {
            f"uses_{tag}_tag_input": count
            for tag, count in primitives_summary["semantic_tag_metrics"].items()
        },
    )
    summary_df = pd.DataFrame(
        [{"Metric": k, "Count": v} for k, v in summary_dict.items()],
    )
    return summary_df


def get_default_aggregation_primitives():
    agg_primitives = [
        featuretools.primitives.Sum,
        featuretools.primitives.Std,
        featuretools.primitives.Max,
        featuretools.primitives.Skew,
        featuretools.primitives.Min,
        featuretools.primitives.Mean,
        featuretools.primitives.Count,
        featuretools.primitives.PercentTrue,
        featuretools.primitives.NumUnique,
        featuretools.primitives.Mode,
    ]
    return agg_primitives


def get_default_transform_primitives():
    # featuretools.primitives.TimeSince
    trans_primitives = [
        featuretools.primitives.Age,
        featuretools.primitives.Day,
        featuretools.primitives.Year,
        featuretools.primitives.Month,
        featuretools.primitives.Weekday,
        featuretools.primitives.Haversine,
        featuretools.primitives.NumWords,
        featuretools.primitives.NumCharacters,
    ]
    return trans_primitives


def _get_descriptions(primitives):
    descriptions = []
    for prim in primitives:
        description = ""
        if prim.__doc__ is not None:
            # Break on the empty line between the docstring description and the remainder of the docstring
            description = prim.__doc__.split("\n\n")[0]
            # remove any excess whitespace from line breaks
            description = " ".join(description.split())
        descriptions.append(description)
    return descriptions


def _get_summary_primitives(primitives: List) -> Dict[str, int]:
    """Provides metrics for a list of primitives."""
    unique_input_types = set()
    unique_output_types = set()
    uses_multi_input = 0
    uses_multi_output = 0
    uses_external_data = 0
    are_controllable = 0
    logical_type_metrics = {
        log_type: 0 for log_type in list(list_logical_types()["type_string"])
    }
    semantic_tag_metrics = {
        sem_tag: 0 for sem_tag in list(list_semantic_tags()["name"])
    }
    semantic_tag_metrics.update(
        {"foreign_key": 0},
    )  # not currently in list_semantic_tags()

    for prim in primitives:
        log_in_type_checks = set()
        sem_tag_type_checks = set()
        input_types = prim.flatten_nested_input_types(prim.input_types)
        _check_input_types(
            input_types,
            log_in_type_checks,
            sem_tag_type_checks,
            unique_input_types,
        )
        for ltype in list(log_in_type_checks):
            logical_type_metrics[ltype] += 1

        for sem_tag in list(sem_tag_type_checks):
            semantic_tag_metrics[sem_tag] += 1

        if len(prim.input_types) > 1:
            uses_multi_input += 1

        # checks if number_output_features is set as an instance variable or set as a constant
        if (
            "self.number_output_features =" in getsource(prim.__init__)
            or prim.number_output_features > 1
        ):
            uses_multi_output += 1
        unique_output_types.add(str(prim.return_type))

        if hasattr(prim, "filename"):
            uses_external_data += 1

        if len(getfullargspec(prim.__init__).args) > 1:
            are_controllable += 1

    return {
        "general_metrics": {
            "unique_input_types": len(unique_input_types),
            "unique_output_types": len(unique_output_types),
            "uses_multi_input": uses_multi_input,
            "uses_multi_output": uses_multi_output,
            "uses_external_data": uses_external_data,
            "are_controllable": are_controllable,
        },
        "logical_type_input_metrics": logical_type_metrics,
        "semantic_tag_metrics": semantic_tag_metrics,
    }


def _check_input_types(
    input_types: List[ColumnSchema],
    log_in_type_checks: set,
    sem_tag_type_checks: set,
    unique_input_types: set,
):
    """Checks if any logical types or semantic tags occur in a list of Woodwork input types and keeps track of unique input types."""
    for in_type in input_types:
        if in_type.semantic_tags:
            for sem_tag in in_type.semantic_tags:
                sem_tag_type_checks.add(sem_tag)
        if in_type.logical_type:
            log_in_type_checks.add(in_type.logical_type.type_string)
        unique_input_types.add(str(in_type))


def _get_names_primitives(primitive_func):
    names = []
    primitives = []
    valid_inputs = []
    return_type = []
    for name, primitive in primitive_func().items():
        names.append(name)
        primitives.append(primitive)
        input_types = _get_unique_input_types(primitive.input_types)
        valid_inputs.append(", ".join(input_types))
        return_type.append(
            str(primitive.return_type),
        ) if primitive.return_type is not None else return_type.append(None)
    return names, primitives, valid_inputs, return_type


def _get_unique_input_types(input_types):
    types = set()
    for input_type in input_types:
        if isinstance(input_type, list):
            types |= _get_unique_input_types(input_type)
        else:
            types.add(str(input_type))
    return types


def list_primitive_files(directory):
    """returns list of files in directory that might contain primitives"""
    files = os.listdir(directory)
    keep = []
    for path in files:
        if not check_valid_primitive_path(path):
            continue
        keep.append(os.path.join(directory, path))
    return keep


def check_valid_primitive_path(path):
    if os.path.isdir(path):
        return False

    filename = os.path.basename(path)

    if filename[:2] == "__" or filename[0] == "." or filename[-3:] != ".py":
        return False

    return True


def load_primitive_from_file(filepath):
    """load primitive objects in a file"""
    module = os.path.basename(filepath)[:-3]
    # TODO: what is the first argument"?
    spec = importlib.util.spec_from_file_location(module, filepath)
    module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(module)

    primitives = []
    for primitive_name in vars(module):
        primitive_class = getattr(module, primitive_name)
        if (
            isclass(primitive_class)
            and issubclass(primitive_class, PrimitiveBase)
            and primitive_class not in (AggregationPrimitive, TransformPrimitive)
        ):
            primitives.append((primitive_name, primitive_class))

    if len(primitives) == 0:
        raise RuntimeError("No primitive defined in file %s" % filepath)
    elif len(primitives) > 1:
        raise RuntimeError("More than one primitive defined in file %s" % filepath)

    return primitives[0]


def serialize_primitive(primitive: PrimitiveBase):
    """build a dictionary with the data necessary to construct the given primitive"""
    args_dict = {name: val for name, val in primitive.get_arguments()}
    cls = type(primitive)
    if cls == NumberOfCommonWords and "word_set" in args_dict:
        args_dict["word_set"] = list(args_dict["word_set"])
    return {
        "type": cls.__name__,
        "module": cls.__module__,
        "arguments": args_dict,
    }


class PrimitivesDeserializer(object):
    """
    This class wraps a cache and a generator which iterates over all primitive
    classes. When deserializing a primitive if it is not in the cache then we
    iterate until it is found, adding every seen class to the cache. When
    deserializing the next primitive the iteration resumes where it left off. This
    means that we never visit a class more than once.
    """

    def __init__(self):
        # Cache to avoid repeatedly searching for primitive class
        # (class_name, module_name) -> class
        self.class_cache = {}

        self.primitive_classes = find_descendents(PrimitiveBase)

    def deserialize_primitive(self, primitive_dict):
        """
        Construct a primitive from the given dictionary (output from
        serialize_primitive).
        """
        class_name = primitive_dict["type"]
        module_name = primitive_dict["module"]
        class_cache_key = (class_name, module_name.split(".")[0])

        if class_cache_key in self.class_cache:
            cls = self.class_cache[class_cache_key]
        else:
            cls = self._find_class_in_descendants(class_cache_key)

        if not cls:
            raise RuntimeError(
                'Primitive "%s" in module "%s" not found' % (class_name, module_name),
            )
        arguments = primitive_dict["arguments"]
        if cls == NumberOfCommonWords and "word_set" in arguments:
            # We converted word_set from a set to a list to make it serializable,
            # we should convert it back now.
            arguments["word_set"] = set(arguments["word_set"])
        primitive_instance = cls(**arguments)

        return primitive_instance

    def _find_class_in_descendants(self, search_key):
        for cls in self.primitive_classes:
            cls_key = (cls.__name__, cls.__module__.split(".")[0])
            self.class_cache[cls_key] = cls

            if cls_key == search_key:
                return cls


def get_all_logical_type_names():
    """Helper function that returns all registered woodwork logical types"""
    return {lt.__name__: lt for lt in type_system.registered_types}


================================================
FILE: featuretools/selection/__init__.py
================================================
# flake8: noqa
from featuretools.selection.api import *


================================================
FILE: featuretools/selection/api.py
================================================
# flake8: noqa
from featuretools.selection.selection import *


================================================
FILE: featuretools/selection/selection.py
================================================
import pandas as pd
from woodwork.logical_types import Boolean, BooleanNullable


def remove_low_information_features(feature_matrix, features=None):
    """Select features that have at least 2 unique values and that are not all null

    Args:
        feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances
        features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select

    Returns:
        (feature_matrix, features)

    """
    keep = [
        c
        for c in feature_matrix
        if (
            feature_matrix[c].nunique(dropna=False) > 1
            and feature_matrix[c].dropna().shape[0] > 0
        )
    ]
    feature_matrix = feature_matrix[keep]
    if features is not None:
        features = [f for f in features if f.get_name() in feature_matrix.columns]
        return feature_matrix, features
    return feature_matrix


def remove_highly_null_features(feature_matrix, features=None, pct_null_threshold=0.95):
    """
    Removes columns from a feature matrix that have higher than a set threshold
    of null values.

    Args:
        feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances.
        features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select.
        pct_null_threshold (float): If the percentage of NaN values in an input feature exceeds this amount,
                that feature will be considered highly-null. Defaults to 0.95.

    Returns:
        pd.DataFrame, list[:class:`.FeatureBase`]:
            The feature matrix and the list of generated feature definitions. Matches dfs output.
            If no feature list is provided as input, the feature list will not be returned.
    """
    if pct_null_threshold < 0 or pct_null_threshold > 1:
        raise ValueError(
            "pct_null_threshold must be a float between 0 and 1, inclusive.",
        )

    percent_null_by_col = (feature_matrix.isnull().mean()).to_dict()

    if pct_null_threshold == 0.0:
        keep = [
            f_name
            for f_name, pct_null in percent_null_by_col.items()
            if pct_null <= pct_null_threshold
        ]
    else:
        keep = [
            f_name
            for f_name, pct_null in percent_null_by_col.items()
            if pct_null < pct_null_threshold
        ]

    return _apply_feature_selection(keep, feature_matrix, features)


def remove_single_value_features(
    feature_matrix,
    features=None,
    count_nan_as_value=False,
):
    """Removes columns in feature matrix where all the values are the same.

    Args:
        feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances.
        features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select.
        count_nan_as_value (bool): If True, missing values will be counted as their own unique value.
                    If set to False, a feature that has one unique value and all other
                    data missing will be removed from the feature matrix. Defaults to False.

     Returns:
        pd.DataFrame, list[:class:`.FeatureBase`]:
            The feature matrix and the list of generated feature definitions.
            Matches dfs output.
            If no feature list is provided as input, the feature list will not be returned.
    """
    unique_counts_by_col = feature_matrix.nunique(
        dropna=not count_nan_as_value,
    ).to_dict()

    keep = [
        f_name
        for f_name, unique_count in unique_counts_by_col.items()
        if unique_count > 1
    ]
    return _apply_feature_selection(keep, feature_matrix, features)


def remove_highly_correlated_features(
    feature_matrix,
    features=None,
    pct_corr_threshold=0.95,
    features_to_check=None,
    features_to_keep=None,
):
    """Removes columns in feature matrix that are highly correlated with another column.

    Note:
        We make the assumption that, for a pair of features, the feature that is further
        right in the feature matrix produced by ``dfs`` is the more complex one.
        The assumption does not hold if the order of columns in the feature
        matrix has changed from what ``dfs`` produces.

    Args:
        feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature
                    names and rows are instances. If Woodwork is not initalized, will
                    perform Woodwork initialization, which may result in slightly different
                    types than those in the original feature matrix created by Featuretools.
        features (list[:class:`featuretools.FeatureBase`] or list[str], optional):
                    List of features to select.
        pct_corr_threshold (float): The correlation threshold to be considered highly
                    correlated. Defaults to 0.95.
        features_to_check (list[str], optional): List of column names to check
                    whether any pairs are highly correlated. Will not check any
                    other columns, meaning the only columns that can be removed
                    are in this list. If null, defaults to checking all columns.
        features_to_keep (list[str], optional): List of colum names to keep even
                    if correlated to another column. If null, all columns will be
                    candidates for removal.

    Returns:
        pd.DataFrame, list[:class:`.FeatureBase`]:
            The feature matrix and the list of generated feature definitions.
            Matches dfs output. If no feature list is provided as input,
            the feature list will not be returned. For consistent results,
            do not change the order of features outputted by dfs.
    """
    if feature_matrix.ww.schema is None:
        feature_matrix.ww.init()

    if pct_corr_threshold < 0 or pct_corr_threshold > 1:
        raise ValueError(
            "pct_corr_threshold must be a float between 0 and 1, inclusive.",
        )

    if features_to_check is None:
        features_to_check = list(feature_matrix.columns)
    else:
        for f_name in features_to_check:
            assert (
                f_name in feature_matrix.columns
            ), "feature named {} is not in feature matrix".format(f_name)

    if features_to_keep is None:
        features_to_keep = []

    to_select = ["numeric", Boolean, BooleanNullable]
    fm = feature_matrix.ww[features_to_check]
    fm_to_check = fm.ww.select(include=to_select)

    dropped = set()
    columns_to_check = fm_to_check.columns
    # When two features are found to be highly correlated,
    # we drop the more complex feature
    # Columns produced later in dfs are more complex
    for i in range(len(columns_to_check) - 1, 0, -1):
        more_complex_name = columns_to_check[i]
        more_complex_col = fm_to_check[more_complex_name]

        # Convert boolean or Int64 column to be float64
        if pd.api.types.is_bool_dtype(more_complex_col) or isinstance(
            more_complex_col.dtype,
            pd.Int64Dtype,
        ):
            more_complex_col = more_complex_col.astype("float64")

        for j in range(i - 1, -1, -1):
            less_complex_name = columns_to_check[j]
            less_complex_col = fm_to_check[less_complex_name]

            # Convert boolean or Int64 column to be float64
            if pd.api.types.is_bool_dtype(less_complex_col) or isinstance(
                less_complex_col.dtype,
                pd.Int64Dtype,
            ):
                less_complex_col = less_complex_col.astype("float64")

            if abs(more_complex_col.corr(less_complex_col)) >= pct_corr_threshold:
                dropped.add(more_complex_name)
                break

    keep = [
        f_name
        for f_name in feature_matrix.columns
        if (f_name in features_to_keep or f_name not in dropped)
    ]
    return _apply_feature_selection(keep, feature_matrix, features)


def _apply_feature_selection(keep, feature_matrix, features=None):
    new_matrix = feature_matrix[keep]
    new_feature_names = set(new_matrix.columns)

    if features is not None:
        new_features = []
        for f in features:
            if f.number_output_features > 1:
                slices = [
                    f[i]
                    for i in range(f.number_output_features)
                    if f[i].get_name() in new_feature_names
                ]
                if len(slices) == f.number_output_features:
                    new_features.append(f)
                else:
                    new_features.extend(slices)
            else:
                if f.get_name() in new_feature_names:
                    new_features.append(f)

        return new_matrix, new_features
    return new_matrix


================================================
FILE: featuretools/synthesis/__init__.py
================================================
# flake8: noqa
from featuretools.synthesis.api import *


================================================
FILE: featuretools/synthesis/api.py
================================================
# flake8: noqa
from featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis
from featuretools.synthesis.dfs import dfs
from featuretools.synthesis.encode_features import encode_features
from featuretools.synthesis.get_valid_primitives import get_valid_primitives


================================================
FILE: featuretools/synthesis/deep_feature_synthesis.py
================================================
import functools
import logging
import operator
import warnings
from collections import defaultdict
from typing import Any, DefaultDict, Dict, List, Tuple, Type

from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable

from featuretools import primitives
from featuretools.entityset.entityset import LTI_COLUMN_NAME
from featuretools.entityset.relationship import RelationshipPath
from featuretools.feature_base import (
    AggregationFeature,
    DirectFeature,
    FeatureBase,
    GroupByTransformFeature,
    IdentityFeature,
    TransformFeature,
)
from featuretools.feature_base.cache import CacheType, feature_cache
from featuretools.feature_base.utils import is_valid_input
from featuretools.primitives.base import (
    AggregationPrimitive,
    PrimitiveBase,
    TransformPrimitive,
)
from featuretools.primitives.options_utils import (
    filter_groupby_matches_by_options,
    filter_matches_by_options,
    generate_all_primitive_options,
    ignore_dataframe_for_primitive,
)
from featuretools.utils.gen_utils import camel_and_title_to_snake

logger = logging.getLogger("featuretools")


class DeepFeatureSynthesis(object):
    """Automatically produce features for a target dataframe in an Entityset.

    Args:
        target_dataframe_name (str): Name of dataframe for which to build features.

        entityset (EntitySet): Entityset for which to build features.

        agg_primitives (list[str or :class:`.primitives.`], optional):
            list of Aggregation Feature types to apply.

            Default: ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"]

        trans_primitives (list[str or :class:`.primitives.TransformPrimitive`], optional):
            list of Transform primitives to use.

            Default: ["day", "year", "month", "weekday", "haversine", "num_words", "num_characters"]

        where_primitives (list[str or :class:`.primitives.PrimitiveBase`], optional):
            only add where clauses to these types of Primitives

            Default:

                ["count"]

        groupby_trans_primitives (list[str or :class:`.primitives.TransformPrimitive`], optional):
            list of Transform primitives to make GroupByTransformFeatures with

        max_depth (int, optional) : maximum allowed depth of features.
            Default: 2. If -1, no limit.

        max_features (int, optional) : Cap the number of generated features to
            this number. If -1, no limit.

        allowed_paths (list[list[str]], optional): Allowed dataframe paths to make
            features for. If None, use all paths.

        ignore_dataframes (list[str], optional): List of dataframes to
            blacklist when creating features. If None, use all dataframes.

        ignore_columns (dict[str -> list[str]], optional): List of specific
            columns within each dataframe to blacklist when creating features.
            If None, use all columns.

        seed_features (list[:class:`.FeatureBase`], optional): List of manually
            defined features to use.

        drop_contains (list[str], optional): Drop features
            that contains these strings in name.

        drop_exact (list[str], optional): Drop features that
            exactly match these strings in name.

        where_stacking_limit (int, optional): Cap the depth of the where features.
            Default: 1

        primitive_options (dict[str or tuple[str] or PrimitiveBase -> dict or list[dict]], optional):
            Specify options for a single primitive or a group of primitives.
            Lists of option dicts are used to specify options per input for primitives
            with multiple inputs. Each option ``dict`` can have the following keys:


            ``"include_dataframes"``
                List of dataframes to be included when creating features for
                the primitive(s). All other dataframes will be ignored
                (list[str]).
            ``"ignore_dataframes"``
                List of dataframes to be blacklisted when creating features
                for the primitive(s) (list[str]).
            ``"include_columns"``
                List of specific columns within each dataframe to include when
                creating features for the primitive(s). All other columns
                in a given dataframe will be ignored (dict[str -> list[str]]).
            ``"ignore_columns"``
                List of specific columns within each dataframe to blacklist
                when creating features for the primitive(s) (dict[str ->
                list[str]]).
            ``"include_groupby_dataframes"``
                List of dataframes to be included when finding groupbys. All
                other dataframes will be ignored (list[str]).
            ``"ignore_groupby_dataframes"``
                List of dataframes to blacklist when finding groupbys
                (list[str]).
            ``"include_groupby_columns"``
                List of specific columns within each dataframe to include as
                groupbys, if applicable. All other columns in each
                dataframe will be ignored (dict[str -> list[str]]).
            ``"ignore_groupby_columns"``
                List of specific columns within each dataframe to blacklist
                as groupbys (dict[str -> list[str]]).
    """

    def __init__(
        self,
        target_dataframe_name,
        entityset,
        agg_primitives=None,
        trans_primitives=None,
        where_primitives=None,
        groupby_trans_primitives=None,
        max_depth=2,
        max_features=-1,
        allowed_paths=None,
        ignore_dataframes=None,
        ignore_columns=None,
        primitive_options=None,
        seed_features=None,
        drop_contains=None,
        drop_exact=None,
        where_stacking_limit=1,
    ):
        if target_dataframe_name not in entityset.dataframe_dict:
            es_name = entityset.id or "entity set"
            msg = "Provided target dataframe %s does not exist in %s" % (
                target_dataframe_name,
                es_name,
            )
            raise KeyError(msg)

        # Multiple calls to dfs() should start with a fresh cache
        feature_cache.clear_all()
        feature_cache.enabled = True

        # need to change max_depth to None because DFs terminates when  <0
        if max_depth == -1:
            max_depth = None

        # if just one dataframe, set max depth to 1 (transform stacking rule)
        if len(entityset.dataframe_dict) == 1 and (max_depth is None or max_depth > 1):
            warnings.warn(
                "Only one dataframe in entityset, changing max_depth to "
                "1 since deeper features cannot be created",
            )
            max_depth = 1

        self.max_depth = max_depth

        self.max_features = max_features

        self.allowed_paths = allowed_paths
        if self.allowed_paths:
            self.allowed_paths = set()
            for path in allowed_paths:
                self.allowed_paths.add(tuple(path))

        if ignore_dataframes is None:
            self.ignore_dataframes = set()
        else:
            if not isinstance(ignore_dataframes, list):
                raise TypeError("ignore_dataframes must be a list")
            assert (
                target_dataframe_name not in ignore_dataframes
            ), "Can't ignore target_dataframe!"
            self.ignore_dataframes = set(ignore_dataframes)

        self.ignore_columns = _build_ignore_columns(ignore_columns)
        self.target_dataframe_name = target_dataframe_name
        self.es = entityset

        aggregation_primitive_dict = primitives.get_aggregation_primitives()
        transform_primitive_dict = primitives.get_transform_primitives()
        if agg_primitives is None:
            agg_primitives = primitives.get_default_aggregation_primitives()
        self.agg_primitives = sorted(
            [
                check_primitive(
                    p,
                    "aggregation",
                    aggregation_primitive_dict,
                    transform_primitive_dict,
                )
                for p in agg_primitives
            ],
        )

        if trans_primitives is None:
            trans_primitives = primitives.get_default_transform_primitives()

        self.trans_primitives = sorted(
            [
                check_primitive(
                    p,
                    "transform",
                    aggregation_primitive_dict,
                    transform_primitive_dict,
                )
                for p in trans_primitives
            ],
        )

        if where_primitives is None:
            where_primitives = [primitives.Count]
        self.where_primitives = sorted(
            [
                check_primitive(
                    p,
                    "where",
                    aggregation_primitive_dict,
                    transform_primitive_dict,
                )
                for p in where_primitives
            ],
        )

        if groupby_trans_primitives is None:
            groupby_trans_primitives = []
        self.groupby_trans_primitives = sorted(
            [
                check_primitive(
                    p,
                    "groupby transform",
                    aggregation_primitive_dict,
                    transform_primitive_dict,
                )
                for p in groupby_trans_primitives
            ],
        )

        if primitive_options is None:
            primitive_options = {}
        all_primitives = (
            self.trans_primitives
            + self.agg_primitives
            + self.where_primitives
            + self.groupby_trans_primitives
        )

        (
            self.primitive_options,
            self.ignore_dataframes,
            self.ignore_columns,
        ) = generate_all_primitive_options(
            all_primitives,
            primitive_options,
            self.ignore_dataframes,
            self.ignore_columns,
            self.es,
        )
        self.seed_features = sorted(seed_features or [], key=lambda f: f.unique_name())
        self.drop_exact = drop_exact or []
        self.drop_contains = drop_contains or []
        self.where_stacking_limit = where_stacking_limit

    def build_features(self, return_types=None, verbose=False):
        """Automatically builds feature definitions for target
            dataframe using Deep Feature Synthesis algorithm

        Args:
            return_types (list[woodwork.ColumnSchema] or str, optional):
                List of ColumnSchemas defining the types of
                columns to return. If None, defaults to returning all
                numeric, categorical and boolean types. If given as
                the string 'all', use all available return types.

            verbose (bool, optional): If True, print progress.

        Returns:
            list[BaseFeature]: Returns a list of
                features for target dataframe, sorted by feature depth
                (shallow first).
        """
        all_features = {}

        self.where_clauses = defaultdict(set)

        if return_types is None:
            return_types = [
                ColumnSchema(semantic_tags=["numeric"]),
                ColumnSchema(semantic_tags=["category"]),
                ColumnSchema(logical_type=Boolean),
                ColumnSchema(logical_type=BooleanNullable),
            ]
        elif return_types == "all":
            pass
        else:
            msg = "return_types must be a list, or 'all'"
            assert isinstance(return_types, list), msg

        self._run_dfs(
            self.es[self.target_dataframe_name],
            RelationshipPath([]),
            all_features,
            max_depth=self.max_depth,
        )

        new_features = list(all_features[self.target_dataframe_name].values())

        def filt(f):
            # remove identity features of the ID field of the target dataframe
            if (
                isinstance(f, IdentityFeature)
                and f.dataframe_name == self.target_dataframe_name
                and f.column_name == self.es[self.target_dataframe_name].ww.index
            ):
                return False

            return True

        # filter out features with undesired return types
        if return_types != "all":
            new_features = [
                f
                for f in new_features
                if any(
                    True
                    for schema in return_types
                    if is_valid_input(f.column_schema, schema)
                )
            ]
        new_features = list(filter(filt, new_features))

        new_features.sort(key=lambda f: f.get_depth())

        new_features = self._filter_features(new_features)

        if self.max_features > 0:
            new_features = new_features[: self.max_features]

        if verbose:
            print("Built {} features".format(len(new_features)))
            verbose = None
        return new_features

    def _filter_features(self, features):
        assert isinstance(self.drop_exact, list), "drop_exact must be a list"
        assert isinstance(self.drop_contains, list), "drop_contains must be a list"
        f_keep = []
        for f in features:
            keep = True
            for contains in self.drop_contains:
                if contains in f.get_name():
                    keep = False
                    break

            if f.get_name() in self.drop_exact:
                keep = False

            if keep:
                f_keep.append(f)

        return f_keep

    def _run_dfs(self, dataframe, relationship_path, all_features, max_depth):
        """
        Create features for the provided dataframe

        Args:
            dataframe (DataFrame): Dataframe for which to create features.
            relationship_path (RelationshipPath): The path to this dataframe.
            all_features (dict[dataframe name -> dict[str -> BaseFeature]]):
                Dict containing a dict for each dataframe. Each nested dict
                has features as values with their ids as keys.
            max_depth (int) : Maximum allowed depth of features.
        """
        if max_depth is not None and max_depth < 0:
            return

        all_features[dataframe.ww.name] = {}

        """
        Step 1 - Create identity features
        """
        self._add_identity_features(all_features, dataframe)

        """
        Step 2 - Recursively build features for each dataframe in a backward relationship
        """

        backward_dataframes = self.es.get_backward_dataframes(dataframe.ww.name)
        for b_dataframe_id, sub_relationship_path in backward_dataframes:
            # Skip if we've already created features for this dataframe.
            if b_dataframe_id in all_features:
                continue

            if b_dataframe_id in self.ignore_dataframes:
                continue

            new_path = relationship_path + sub_relationship_path
            if (
                self.allowed_paths
                and tuple(new_path.dataframes()) not in self.allowed_paths
            ):
                continue

            new_max_depth = None
            if max_depth is not None:
                new_max_depth = max_depth - 1
            self._run_dfs(
                dataframe=self.es[b_dataframe_id],
                relationship_path=new_path,
                all_features=all_features,
                max_depth=new_max_depth,
            )

        """
        Step 3 - Create aggregation features for all deep backward relationships
        """

        backward_dataframes = self.es.get_backward_dataframes(
            dataframe.ww.name,
            deep=True,
        )
        for b_dataframe_id, sub_relationship_path in backward_dataframes:
            if b_dataframe_id in self.ignore_dataframes:
                continue

            new_path = relationship_path + sub_relationship_path
            if (
                self.allowed_paths
                and tuple(new_path.dataframes()) not in self.allowed_paths
            ):
                continue

            self._build_agg_features(
                parent_dataframe=self.es[dataframe.ww.name],
                child_dataframe=self.es[b_dataframe_id],
                all_features=all_features,
                max_depth=max_depth,
                relationship_path=sub_relationship_path,
            )

        """
        Step 4 - Create transform features of identity and aggregation features
        """

        self._build_transform_features(all_features, dataframe, max_depth=max_depth)

        """
        Step 5 - Recursively build features for each dataframe in a forward relationship
        """

        forward_dataframes = self.es.get_forward_dataframes(dataframe.ww.name)
        for f_dataframe_id, sub_relationship_path in forward_dataframes:
            # Skip if we've already created features for this dataframe.
            if f_dataframe_id in all_features:
                continue

            if f_dataframe_id in self.ignore_dataframes:
                continue

            new_path = relationship_path + sub_relationship_path
            if (
                self.allowed_paths
                and tuple(new_path.dataframes()) not in self.allowed_paths
            ):
                continue

            new_max_depth = None
            if max_depth is not None:
                new_max_depth = max_depth - 1
            self._run_dfs(
                dataframe=self.es[f_dataframe_id],
                relationship_path=new_path,
                all_features=all_features,
                max_depth=new_max_depth,
            )

        """
        Step 6 - Create direct features for forward relationships
        """

        forward_dataframes = self.es.get_forward_dataframes(dataframe.ww.name)
        for f_dataframe_id, sub_relationship_path in forward_dataframes:
            if f_dataframe_id in self.ignore_dataframes:
                continue

            new_path = relationship_path + sub_relationship_path
            if (
                self.allowed_paths
                and tuple(new_path.dataframes()) not in self.allowed_paths
            ):
                continue

            self._build_forward_features(
                all_features=all_features,
                relationship_path=sub_relationship_path,
                max_depth=max_depth,
            )

        """
        Step 7 - Create transform features of direct features
        """

        self._build_transform_features(
            all_features,
            dataframe,
            max_depth=max_depth,
            require_direct_input=True,
        )

        # now that all  features are added, build where clauses
        self._build_where_clauses(all_features, dataframe)

    def _handle_new_feature(self, new_feature, all_features):
        """Adds new feature to the dict

        Args:
            new_feature (:class:`.FeatureBase`): New feature being
                checked.
            all_features (dict[dataframe name -> dict[str -> BaseFeature]]):
                Dict containing a dict for each dataframe. Each nested dict
                has features as values with their ids as keys.

        Returns:
            dict[PrimitiveBase -> dict[feature id -> feature]]: Dict of
                features with any new features.

        Raises:
            Exception: Attempted to add a single feature multiple times
        """
        dataframe_name = new_feature.dataframe_name
        name = new_feature.unique_name()

        # Warn if this feature is already present, and it is not a seed feature.
        # It is expected that a seed feature could also be generated by dfs.
        if name in all_features[dataframe_name] and name not in (
            f.unique_name() for f in self.seed_features
        ):
            logger.warning(
                "Attempting to add feature %s which is already "
                "present. This is likely a bug." % new_feature,
            )
            return

        all_features[dataframe_name][name] = new_feature

    def _add_identity_features(self, all_features, dataframe):
        """converts all columns from the given dataframe into features

        Args:
            all_features (dict[dataframe name -> dict[str -> BaseFeature]]):
                Dict containing a dict for each dataframe. Each nested dict
                has features as values with their ids as keys.
            dataframe (DataFrame): DataFrame to calculate features for.
        """
        for col in dataframe.columns:
            if col in self.ignore_columns[dataframe.ww.name] or col == LTI_COLUMN_NAME:
                continue
            new_f = IdentityFeature(self.es[dataframe.ww.name].ww[col])
            self._handle_new_feature(all_features=all_features, new_feature=new_f)

        # add seed features, if any, for dfs to build on top of
        # if there are any multi output features, this will build on
        # top of each output of the feature.
        for f in self.seed_features:
            if f.dataframe_name == dataframe.ww.name:
                self._handle_new_feature(all_features=all_features, new_feature=f)

    def _build_where_clauses(self, all_features, dataframe):
        """Traverses all identity features and creates a Compare for
            each one, based on some heuristics

        Args:
            all_features (dict[dataframe name -> dict[str -> BaseFeature]]):
                Dict containing a dict for each dataframe. Each nested dict
                has features as values with their ids as keys.
          dataframe (DataFrame): DataFrame to calculate features for.
        """

        def is_valid_feature(f):
            if isinstance(f, IdentityFeature):
                return True
            if isinstance(f, DirectFeature) and getattr(
                f.base_features[0],
                "column_name",
                None,
            ):
                return True
            return False

        for feat in [
            f for f in all_features[dataframe.ww.name].values() if is_valid_feature(f)
        ]:
            # Get interesting_values from the EntitySet that was passed, which
            # is assumed to be the most recent version of the EntitySet.
            # Features can contain a stale EntitySet reference without
            # interesting_values
            if isinstance(feat, DirectFeature):
                df = feat.base_features[0].dataframe_name
                col = feat.base_features[0].column_name
            else:
                df = feat.dataframe_name
                col = feat.column_name
            metadata = self.es[df].ww.columns[col].metadata
            interesting_values = metadata.get("interesting_values")
            if interesting_values:
                for val in interesting_values:
                    self.where_clauses[dataframe.ww.name].add(feat == val)

    def _build_transform_features(
        self,
        all_features,
        dataframe,
        max_depth=0,
        require_direct_input=False,
    ):
        """Creates trans_features for all the columns in a dataframe

        Args:
            all_features (dict[dataframe name: dict->[str->:class:`BaseFeature`]]):
                Dict containing a dict for each dataframe. Each nested dict
                has features as values with their ids as keys

          dataframe (DataFrame): DataFrame to calculate features for.
        """

        new_max_depth = None
        if max_depth is not None:
            new_max_depth = max_depth - 1

        # Keep track of features to add until the end to avoid applying
        # transform primitives to features that were also built by transform primitives
        features_to_add = []

        for trans_prim in self.trans_primitives:
            current_options = self.primitive_options.get(
                trans_prim,
                self.primitive_options.get(trans_prim.name),
            )
            if ignore_dataframe_for_primitive(current_options, dataframe):
                continue

            input_types = trans_prim.input_types

            matching_inputs = self._get_matching_inputs(
                all_features,
                dataframe,
                new_max_depth,
                input_types,
                trans_prim,
                current_options,
                require_direct_input=require_direct_input,
                feature_filter=not_a_transform_input,
            )

            for matching_input in matching_inputs:
                if not can_stack_primitive_on_inputs(trans_prim, matching_input):
                    continue
                if not any(
                    True for bf in matching_input if bf.number_output_features != 1
                ):
                    new_f = TransformFeature(matching_input, primitive=trans_prim)
                    features_to_add.append(new_f)

        for groupby_prim in self.groupby_trans_primitives:
            current_options = self.primitive_options.get(
                groupby_prim,
                self.primitive_options.get(groupby_prim.name),
            )
            if ignore_dataframe_for_primitive(current_options, dataframe, groupby=True):
                continue
            input_types = groupby_prim.input_types[:]
            matching_inputs = self._get_matching_inputs(
                all_features,
                dataframe,
                new_max_depth,
                input_types,
                groupby_prim,
                current_options,
                feature_filter=not_a_transform_input,
            )

            # get columns to use as groupbys, use IDs as default unless other groupbys specified
            if any(
                True
                for option in current_options
                if dataframe.ww.name in option.get("include_groupby_columns", [])
            ):
                column_schemas = "all"
            else:
                column_schemas = [ColumnSchema(semantic_tags=["foreign_key"])]
            groupby_matches = self._features_by_type(
                all_features=all_features,
                dataframe=dataframe,
                max_depth=new_max_depth,
                column_schemas=column_schemas,
            )
            groupby_matches = filter_groupby_matches_by_options(
                groupby_matches,
                current_options,
            )

            for matching_input in matching_inputs:
                if not can_stack_primitive_on_inputs(groupby_prim, matching_input):
                    continue
                if any(True for bf in matching_input if bf.number_output_features != 1):
                    continue
                if require_direct_input:
                    if any_direct_in_matching_input := any(
                        isinstance(bf, DirectFeature) for bf in matching_input
                    ):
                        all_direct_and_same_path_in_matching_input = (
                            _all_direct_and_same_path(matching_input)
                        )
                for groupby in groupby_matches:
                    if require_direct_input:
                        # If require_direct_input, require a DirectFeature in input or as a
                        # groupby, and don't create features of inputs/groupbys which are
                        # all direct features with the same relationship path
                        #
                        # If we require_direct_input, we skip Feature generation
                        # in the following two cases:
                        # (1) --> There are no DirectFeatures in the matching input,
                        #         and groupby is not a DirectFeature
                        # (2) --> All of the matching input and groupby are DirectFeatures
                        #         with the same relationship path
                        groupby_is_direct = isinstance(groupby[0], DirectFeature)
                        # Checks case (1)
                        if not any_direct_in_matching_input:
                            if not groupby_is_direct:
                                continue
                        elif all_direct_and_same_path_in_matching_input:
                            # Checks case (2)
                            if (
                                groupby_is_direct
                                and groupby[0].relationship_path
                                == matching_input[0].relationship_path
                            ):
                                continue
                    new_f = GroupByTransformFeature(
                        list(matching_input),
                        groupby=groupby[0],
                        primitive=groupby_prim,
                    )
                    features_to_add.append(new_f)
        for new_f in features_to_add:
            self._handle_new_feature(all_features=all_features, new_feature=new_f)

    def _build_forward_features(self, all_features, relationship_path, max_depth=0):
        _, relationship = relationship_path[0]

        child_dataframe_name = relationship.child_dataframe.ww.name
        parent_dataframe = relationship.parent_dataframe

        features = self._features_by_type(
            all_features=all_features,
            dataframe=parent_dataframe,
            max_depth=max_depth,
            column_schemas="all",
        )

        for f in features:
            if self._feature_in_relationship_path(relationship_path, f):
                continue

            # limits allowing direct features of agg_feats with where clauses
            if isinstance(f, AggregationFeature):
                deep_base_features = [f] + f.get_dependencies(deep=True)
                for feat in deep_base_features:
                    if isinstance(feat, AggregationFeature) and feat.where is not None:
                        continue

            new_f = DirectFeature(f, child_dataframe_name, relationship=relationship)

            self._handle_new_feature(all_features=all_features, new_feature=new_f)

    def _build_agg_features(
        self,
        all_features,
        parent_dataframe,
        child_dataframe,
        max_depth,
        relationship_path,
    ):
        new_max_depth = None
        if max_depth is not None:
            new_max_depth = max_depth - 1
        for agg_prim in self.agg_primitives:
            current_options = self.primitive_options.get(
                agg_prim,
                self.primitive_options.get(agg_prim.name),
            )

            if ignore_dataframe_for_primitive(current_options, child_dataframe):
                continue

            def feature_filter(f):
                # Remove direct features of parent dataframe and features in relationship path.
                return (
                    not _direct_of_dataframe(f, parent_dataframe)
                ) and not self._feature_in_relationship_path(relationship_path, f)

            input_types = agg_prim.input_types
            matching_inputs = self._get_matching_inputs(
                all_features,
                child_dataframe,
                new_max_depth,
                input_types,
                agg_prim,
                current_options,
                feature_filter=feature_filter,
            )

            matching_inputs = filter_matches_by_options(
                matching_inputs,
                current_options,
            )
            wheres = list(self.where_clauses[child_dataframe.ww.name])

            for matching_input in matching_inputs:
                if not can_stack_primitive_on_inputs(agg_prim, matching_input):
                    continue
                new_f = AggregationFeature(
                    matching_input,
                    parent_dataframe_name=parent_dataframe.ww.name,
                    relationship_path=relationship_path,
                    primitive=agg_prim,
                )

                self._handle_new_feature(new_f, all_features)

                # limit the stacking of where features
                # count up the the number of where features
                # in this feature and its dependencies
                feat_wheres = []
                for f in matching_input:
                    if isinstance(f, AggregationFeature) and f.where is not None:
                        feat_wheres.append(f)
                    for feat in f.get_dependencies(deep=True):
                        if (
                            isinstance(feat, AggregationFeature)
                            and feat.where is not None
                        ):
                            feat_wheres.append(feat)

                if len(feat_wheres) >= self.where_stacking_limit:
                    continue

                # limits the aggregation feature by the given allowed feature types.
                if not any(
                    True
                    for primitive in self.where_primitives
                    if issubclass(type(agg_prim), type(primitive))
                ):
                    continue

                for where in wheres:
                    # limits the where feats so they are different than base feats
                    base_names = [f.unique_name() for f in new_f.base_features]
                    if any(
                        True
                        for base_feat in where.base_features
                        if base_feat.unique_name() in base_names
                    ):
                        continue

                    new_f = AggregationFeature(
                        matching_input,
                        parent_dataframe_name=parent_dataframe.ww.name,
                        relationship_path=relationship_path,
                        where=where,
                        primitive=agg_prim,
                    )
                    self._handle_new_feature(new_f, all_features)

    def _features_by_type(
        self,
        all_features,
        dataframe,
        max_depth,
        column_schemas=None,
    ):
        if max_depth is not None and max_depth < 0:
            return []

        if dataframe.ww.name not in all_features:
            return []

        def expand_features(feature) -> List[Any]:
            """Internal method to return either the single feature
                or the output features

            Args:
                feature (Feature): Feature instance

            Returns:
                List[Any]: list of features
            """
            outputs = feature.number_output_features
            if outputs > 1:
                return [feature[i] for i in range(outputs)]
            return [feature]

        # Build the complete list of features prior to processing
        selected_features = [
            expand_features(feature)
            for feature in all_features[dataframe.ww.name].values()
        ]
        selected_features = functools.reduce(operator.iconcat, selected_features, [])

        column_schemas = column_schemas if column_schemas else set()

        if max_depth is None and column_schemas == "all":
            return selected_features

        # assigning seed_features locally adds a slight performance benefit by not having to look
        # up the property for each round of the comprehension
        seed_features = self.seed_features
        if max_depth is not None:
            selected_features = [
                feature
                for feature in selected_features
                if get_feature_depth(feature, stop_at=seed_features) <= max_depth
            ]

        def valid_input(column_schema) -> bool:
            """Helper method to validate the feature schema
               to the allowed column_schemas

            Args:
                column_schema (ColumnSchema): feature column schema

            Returns:
                bool: True if valid
            """
            return any(
                True
                for schema in column_schemas
                if is_valid_input(column_schema, schema)
            )

        if column_schemas and column_schemas != "all":
            selected_features = [
                feature
                for feature in selected_features
                if valid_input(feature.column_schema)
            ]

        return selected_features

    def _feature_in_relationship_path(self, relationship_path, feature):
        # must be identity feature to be in the relationship path
        if not isinstance(feature, IdentityFeature):
            return False

        for _, relationship in relationship_path:
            if (
                relationship.child_name == feature.dataframe_name
                and relationship._child_column_name == feature.column_name
            ):
                return True

            if (
                relationship.parent_name == feature.dataframe_name
                and relationship._parent_column_name == feature.column_name
            ):
                return True

        return False

    def _get_matching_inputs(
        self,
        all_features,
        dataframe,
        max_depth,
        input_types,
        primitive,
        primitive_options,
        require_direct_input=False,
        feature_filter=None,
    ):
        if not isinstance(input_types[0], list):
            input_types = [input_types]
        matching_inputs = []

        for input_type in input_types:
            features = self._features_by_type(
                all_features=all_features,
                dataframe=dataframe,
                max_depth=max_depth,
                column_schemas=list(input_type),
            )
            if not features:
                continue

            if feature_filter:
                features = [f for f in features if feature_filter(f)]

            matches = match(
                input_type,
                features,
                commutative=primitive.commutative,
                require_direct_input=require_direct_input,
            )

            matching_inputs.extend(matches)

        # everything following depends on populated matching_inputs
        if not matching_inputs:
            return matching_inputs

        if require_direct_input:
            # Don't create trans features of inputs which are all direct
            # features with the same relationship_path.
            matching_inputs = {
                inputs
                for inputs in matching_inputs
                if not _all_direct_and_same_path(inputs)
            }
        matching_inputs = filter_matches_by_options(
            matching_inputs,
            primitive_options,
            commutative=primitive.commutative,
        )

        # Don't build features on numeric foreign key columns
        matching_inputs = [
            match
            for match in matching_inputs
            if not _match_contains_numeric_foreign_key(match)
        ]

        return matching_inputs


def _match_contains_numeric_foreign_key(match):
    match_schema = ColumnSchema(semantic_tags={"foreign_key", "numeric"})
    return any(True for f in match if is_valid_input(f.column_schema, match_schema))


def not_a_transform_input(feature):
    """
    Verifies transform inputs are not transform features or direct features of transform features
    Returns True if a transform primitive can stack on the feature, and False if it cannot.
    """
    primitive = _find_root_primitive(feature)
    return not isinstance(primitive, TransformPrimitive)


def _find_root_primitive(feature):
    """
    If a feature is a DirectFeature, finds the primitive of
    the "original" base feature.
    """
    if isinstance(feature, DirectFeature):
        return _find_root_primitive(feature.base_features[0])
    return feature.primitive


def _check_if_stacking_is_prohibited(
    feature: FeatureBase,
    f_primitive: PrimitiveBase,
    primitive: PrimitiveBase,
    primitive_class: Type[PrimitiveBase],
    primitive_stack_on_self: bool,
    tuple_primitive_stack_on_exclude: Tuple[Type[PrimitiveBase]],
):
    if not primitive_stack_on_self and isinstance(f_primitive, primitive_class):
        return True

    if isinstance(f_primitive, tuple_primitive_stack_on_exclude):
        return True

    if feature.number_output_features > 1:
        return True

    if f_primitive.base_of_exclude is not None and isinstance(
        primitive,
        tuple(f_primitive.base_of_exclude),
    ):
        return True
    return False


def _check_if_stacking_is_permitted(
    f_primitive: PrimitiveBase,
    primitive_class: Type[PrimitiveBase],
    primitive_stack_on_self: bool,
    tuple_primitive_stack_on: Tuple[Type[PrimitiveBase]],
):
    if primitive_stack_on_self and isinstance(f_primitive, primitive_class):
        return True
    if tuple_primitive_stack_on is None or isinstance(
        f_primitive,
        tuple_primitive_stack_on,
    ):
        return True
    if f_primitive.base_of is None:
        return True
    if primitive_class in f_primitive.base_of:
        return True
    return False


def can_stack_primitive_on_inputs(primitive: PrimitiveBase, inputs: List[FeatureBase]):
    """
    Checks if features in inputs can be used with supplied primitive
    using the stacking rules.
    Returns True if stacking is possible, and False if not.
    """

    primitive_class = primitive.__class__
    tuple_primitive_stack_on = (
        tuple(primitive.stack_on) if primitive.stack_on is not None else None
    )
    tuple_primitive_stack_on_exclude = (
        tuple(primitive.stack_on_exclude)
        if primitive.stack_on_exclude is not None
        else tuple()
    )
    primitive_stack_on_self: bool = primitive.stack_on_self

    for feature in inputs:
        # In the case that the feature is a DirectFeature, the feature's primitive will be a PrimitiveBase object.
        # However, we want to check stacking rules with the primitive the DirectFeature is based on.
        f_primitive = _find_root_primitive(feature)

        # check if stacking is prohibited
        if _check_if_stacking_is_prohibited(
            feature,
            f_primitive,
            primitive,
            primitive_class,
            primitive_stack_on_self,
            tuple_primitive_stack_on_exclude,
        ):
            return False

        # we permit stacking only if it is not prohibited and meets the criterion to be permitted
        if not _check_if_stacking_is_permitted(
            f_primitive,
            primitive_class,
            primitive_stack_on_self,
            tuple_primitive_stack_on,
        ):
            return False

    # if we reach this line nothing is prohibited and stacking is permitted for all inputs
    return True


def match_by_schema(features, column_schema):
    return [f for f in features if is_valid_input(f.column_schema, column_schema)]


def match(
    input_types,
    features,
    replace=False,
    commutative=False,
    require_direct_input=False,
):
    to_match = input_types[0]

    matches = match_by_schema(features, to_match)

    if len(input_types) == 1:
        return [
            (m,)
            for m in matches
            if (not require_direct_input or isinstance(m, DirectFeature))
        ]

    matching_inputs = set()

    for m in matches:
        copy = features[:]

        if not replace:
            copy = [c for c in copy if c.unique_name() != m.unique_name()]

        # If we need a DirectFeature and this is not a DirectFeature then one of the rest must be.
        still_require_direct_input = require_direct_input and not isinstance(
            m,
            DirectFeature,
        )
        rest = match(
            input_types[1:],
            copy,
            replace,
            require_direct_input=still_require_direct_input,
        )

        for r in rest:
            new_match = [m] + list(r)

            # commutative uses frozenset instead of tuple because it doesn't
            # want multiple orderings of the same input
            if commutative:
                new_match = frozenset(new_match)
            else:
                new_match = tuple(new_match)
            matching_inputs.add(new_match)

    if commutative:
        matching_inputs = {
            tuple(sorted(s, key=lambda x: x.get_name().lower()))
            for s in matching_inputs
        }

    return matching_inputs


def handle_primitive(primitive):
    if not isinstance(primitive, PrimitiveBase):
        primitive = primitive()
    assert isinstance(primitive, PrimitiveBase), "must be a primitive"
    return primitive


def check_primitive(
    primitive,
    prim_type,
    aggregation_primitive_dict,
    transform_primitive_dict,
):
    if prim_type in ("transform", "groupby transform"):
        prim_dict = transform_primitive_dict
        supertype = TransformPrimitive
        arg_name = (
            "trans_primitives"
            if prim_type == "transform"
            else "groupby_trans_primitives"
        )
        s = "a transform"
    if prim_type in ("aggregation", "where"):
        prim_dict = aggregation_primitive_dict
        supertype = AggregationPrimitive
        arg_name = (
            "agg_primitives" if prim_type == "aggregation" else "where_primitives"
        )
        s = "an aggregation"

    if isinstance(primitive, str):
        prim_string = camel_and_title_to_snake(primitive)
        if prim_string not in prim_dict:
            raise ValueError(
                "Unknown {} primitive {}. "
                "Call ft.primitives.list_primitives() to get"
                " a list of available primitives".format(prim_type, prim_string),
            )
        primitive = prim_dict[prim_string]
    primitive = handle_primitive(primitive)
    if not isinstance(primitive, supertype):
        raise ValueError(
            "Primitive {} in {} is not {} " "primitive".format(
                type(primitive),
                arg_name,
                s,
            ),
        )
    return primitive


def _all_direct_and_same_path(input_features: List[FeatureBase]) -> bool:
    """Given a list of features, returns True if they are all
    DirectFeatures with the same relationship_path, and False if not
    """
    path = input_features[0].relationship_path
    for f in input_features:
        if not isinstance(f, DirectFeature) or f.relationship_path != path:
            return False
    return True


def _build_ignore_columns(input_dict: Dict[str, List[str]]) -> DefaultDict[str, set]:
    """Iterates over the input dictionary to build the ignore_columns defaultdict.
    Expects the input_dict's keys to be strings, and values to be lists of strings.
    Throws a TypeError if they are not.
    """
    ignore_columns = defaultdict(set)
    if input_dict is not None:
        for df_name, cols in input_dict.items():
            if not isinstance(df_name, str) or not isinstance(cols, list):
                raise TypeError("ignore_columns should be dict[str -> list]")
            elif not all(isinstance(c, str) for c in cols):
                raise TypeError("list in ignore_columns must only have string values")
            ignore_columns[df_name] = set(cols)
    return ignore_columns


def _direct_of_dataframe(feature, parent_dataframe):
    return (
        isinstance(feature, DirectFeature)
        and feature.parent_dataframe_name == parent_dataframe.ww.name
    )


def get_feature_depth(feature, stop_at=None):
    """Helper method to allow caching of feature.get_depth()
    Why here and not in FeatureBase?  This keeps the caching
    local to DFS.
    """
    hash_key = hash(f"{feature.get_name()}{feature.dataframe_name}{stop_at}")
    if cached_depth := feature_cache.get(CacheType.DEPTH, hash_key):
        return cached_depth
    depth = feature.get_depth(stop_at=stop_at)
    feature_cache.add(CacheType.DEPTH, hash_key, depth)
    return depth


================================================
FILE: featuretools/synthesis/dfs.py
================================================
import warnings

from featuretools.computational_backends import calculate_feature_matrix
from featuretools.entityset import EntitySet
from featuretools.exceptions import UnusedPrimitiveWarning
from featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis
from featuretools.synthesis.utils import _categorize_features, get_unused_primitives
from featuretools.utils import entry_point


@entry_point("featuretools_dfs")
def dfs(
    dataframes=None,
    relationships=None,
    entityset=None,
    target_dataframe_name=None,
    cutoff_time=None,
    instance_ids=None,
    agg_primitives=None,
    trans_primitives=None,
    groupby_trans_primitives=None,
    allowed_paths=None,
    max_depth=2,
    ignore_dataframes=None,
    ignore_columns=None,
    primitive_options=None,
    seed_features=None,
    drop_contains=None,
    drop_exact=None,
    where_primitives=None,
    max_features=-1,
    cutoff_time_in_index=False,
    save_progress=None,
    features_only=False,
    training_window=None,
    approximate=None,
    chunk_size=None,
    n_jobs=1,
    dask_kwargs=None,
    verbose=False,
    return_types=None,
    progress_callback=None,
    include_cutoff_time=True,
):
    """Calculates a feature matrix and features given a dictionary of dataframes
    and a list of relationships.


    Args:
        dataframes (dict[str -> tuple(DataFrame, str, str, dict[str -> str/Woodwork.LogicalType], dict[str->str/set], boolean)]):
            Dictionary of DataFrames. Entries take the format
            {dataframe name -> (dataframe, index column, time_index, logical_types, semantic_tags, make_index)}.
            Note that only the dataframe is required. If a Woodwork DataFrame is supplied, any other parameters
            will be ignored.

        relationships (list[(str, str, str, str)]): List of relationships
            between dataframes. List items are a tuple with the format
            (parent dataframe name, parent column, child dataframe name, child column).

        entityset (EntitySet): An already initialized entityset. Required if
            dataframes and relationships are not defined.

        target_dataframe_name (str): Name of dataframe on which to make predictions.

        cutoff_time (pd.DataFrame or Datetime or str): Specifies times at which to calculate
            the features for each instance. The resulting feature matrix will use data
            up to and including the cutoff_time. Can either be a DataFrame, a single
            value, or a string that can be parsed into a datetime. If a DataFrame is passed
            the instance ids for which to calculate features must be in a column with the
            same name as the target dataframe index or a column named `instance_id`.
            The cutoff time values in the DataFrame must be in a column with the same name as
            the target dataframe time index or a column named `time`. If the DataFrame has more
            than two columns, any additional columns will be added to the resulting feature
            matrix. If a single value is passed, this value will be used for all instances.

        instance_ids (list): List of instances on which to calculate features. Only
            used if cutoff_time is a single datetime.

        agg_primitives (list[str or AggregationPrimitive], optional): List of Aggregation
            Feature types to apply.

                Default: ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"]

        trans_primitives (list[str or TransformPrimitive], optional):
            List of Transform Feature functions to apply.

                Default: ["day", "year", "month", "weekday", "haversine", "num_words", "num_characters"]

        groupby_trans_primitives (list[str or TransformPrimitive], optional):
            list of Transform primitives to make GroupByTransformFeatures with

        allowed_paths (list[list[str]]): Allowed dataframe paths on which to make
            features.

        max_depth (int) : Maximum allowed depth of features.

        ignore_dataframes (list[str], optional): List of dataframes to
            blacklist when creating features.

        ignore_columns (dict[str -> list[str]], optional): List of specific
            columns within each dataframe to blacklist when creating features.

        primitive_options (list[dict[str or tuple[str] -> dict] or dict[str or tuple[str] -> dict, optional]):
            Specify options for a single primitive or a group of primitives.
            Lists of option dicts are used to specify options per input for primitives
            with multiple inputs. Each option ``dict`` can have the following keys:

            ``"include_dataframes"``
                List of dataframes to be included when creating features for
                the primitive(s). All other dataframes will be ignored
                (list[str]).
            ``"ignore_dataframes"``
                List of dataframes to be blacklisted when creating features
                for the primitive(s) (list[str]).
            ``"include_columns"``
                List of specific columns within each dataframe to include when
                creating features for the primitive(s). All other columns
                in a given dataframe will be ignored (dict[str -> list[str]]).
            ``"ignore_columns"``
                List of specific columns within each dataframe to blacklist
                when creating features for the primitive(s) (dict[str ->
                list[str]]).
            ``"include_groupby_dataframes"``
                List of dataframes to be included when finding groupbys. All
                other dataframes will be ignored (list[str]).
            ``"ignore_groupby_dataframes"``
                List of dataframes to blacklist when finding groupbys
                (list[str]).
            ``"include_groupby_columns"``
                List of specific columns within each dataframe to include as
                groupbys, if applicable. All other columns in each
                dataframe will be ignored (dict[str -> list[str]]).
            ``"ignore_groupby_columns"``
                List of specific columns within each dataframe to blacklist
                as groupbys (dict[str -> list[str]]).

        seed_features (list[:class:`.FeatureBase`]): List of manually defined
            features to use.

        drop_contains (list[str], optional): Drop features
            that contains these strings in name.

        drop_exact (list[str], optional): Drop features that
            exactly match these strings in name.

        where_primitives (list[str or PrimitiveBase], optional):
            List of Primitives names (or types) to apply with where clauses.

                Default:

                    ["count"]

        max_features (int, optional) : Cap the number of generated features to
                this number. If -1, no limit.

        features_only (bool, optional): If True, returns the list of
            features without calculating the feature matrix.

        cutoff_time_in_index (bool): If True, return a DataFrame with a MultiIndex
            where the second index is the cutoff time (first is instance id).
            DataFrame will be sorted by (time, instance_id).

        training_window (Timedelta or str, optional):
            Window defining how much time before the cutoff time data
            can be used when calculating features. If ``None`` , all data
            before cutoff time is used. Defaults to ``None``. Month and year
            units are not relative when Pandas Timedeltas are used. Relative
            units should be passed as a Featuretools Timedelta or a string.

        approximate (Timedelta): Bucket size to group instances with similar
            cutoff times by for features with costly calculations. For example,
            if bucket is 24 hours, all instances with cutoff times on the same
            day will use the same calculation for expensive features.

        save_progress (str, optional): Path to save intermediate computational results.

        n_jobs (int, optional): number of parallel processes to use when
            calculating feature matrix

        chunk_size (int or float or None or "cutoff time", optional): Number
            of rows of output feature matrix to calculate at time. If passed an
            integer greater than 0, will try to use that many rows per chunk.
            If passed a float value between 0 and 1 sets the chunk size to that
            percentage of all instances. If passed the string "cutoff time",
            rows are split per cutoff time.

        dask_kwargs (dict, optional): Dictionary of keyword arguments to be
            passed when creating the dask client and scheduler. Even if n_jobs
            is not set, using `dask_kwargs` will enable multiprocessing.
            Main parameters:

            cluster (str or dask.distributed.LocalCluster):
                cluster or address of cluster to send tasks to. If unspecified,
                a cluster will be created.
            diagnostics port (int):
                port number to use for web dashboard.  If left unspecified, web
                interface will not be enabled.

            Valid keyword arguments for LocalCluster will also be accepted.

        return_types (list[woodwork.ColumnSchema] or str, optional):
            List of ColumnSchemas defining the types of
            columns to return. If None, defaults to returning all
            numeric, categorical and boolean types. If given as
            the string 'all', returns all available types.

        progress_callback (callable): function to be called with incremental progress updates.
            Has the following parameters:

                update: percentage change (float between 0 and 100) in progress since last call
                progress_percent: percentage (float between 0 and 100) of total computation completed
                time_elapsed: total time in seconds that has elapsed since start of call

        include_cutoff_time (bool): Include data at cutoff times in feature calculations. Defaults to ``True``.

    Returns:
        list[:class:`.FeatureBase`], pd.DataFrame:
            The list of generated feature defintions, and the feature matrix.
            If ``features_only`` is ``True``, the feature matrix will not be generated.

    Examples:
        .. code-block:: python

            from featuretools.primitives import Mean
            # cutoff times per instance
            dataframes = {
                "sessions" : (session_df, "id"),
                "transactions" : (transactions_df, "id", "transaction_time")
            }
            relationships = [("sessions", "id", "transactions", "session_id")]
            feature_matrix, features = dfs(dataframes=dataframes,
                                           relationships=relationships,
                                           target_dataframe_name="transactions",
                                           cutoff_time=cutoff_times)
            feature_matrix

            features = dfs(dataframes=dataframes,
                           relationships=relationships,
                           target_dataframe_name="transactions",
                           features_only=True)
    """
    if not isinstance(entityset, EntitySet):
        entityset = EntitySet("dfs", dataframes, relationships)

    dfs_object = DeepFeatureSynthesis(
        target_dataframe_name,
        entityset,
        agg_primitives=agg_primitives,
        trans_primitives=trans_primitives,
        groupby_trans_primitives=groupby_trans_primitives,
        max_depth=max_depth,
        where_primitives=where_primitives,
        allowed_paths=allowed_paths,
        drop_exact=drop_exact,
        drop_contains=drop_contains,
        ignore_dataframes=ignore_dataframes,
        ignore_columns=ignore_columns,
        primitive_options=primitive_options,
        max_features=max_features,
        seed_features=seed_features,
    )

    features = dfs_object.build_features(verbose=verbose, return_types=return_types)

    trans, agg, groupby, where = _categorize_features(features)

    trans_unused = get_unused_primitives(trans_primitives, trans)
    agg_unused = get_unused_primitives(agg_primitives, agg)
    groupby_unused = get_unused_primitives(groupby_trans_primitives, groupby)
    where_unused = get_unused_primitives(where_primitives, where)

    unused_primitives = [trans_unused, agg_unused, groupby_unused, where_unused]
    if any(unused_primitives):
        warn_unused_primitives(unused_primitives)

    if features_only:
        return features

    assert (
        features != []
    ), "No features can be generated from the specified primitives. Please make sure the primitives you are using are compatible with the variable types in your data."

    feature_matrix = calculate_feature_matrix(
        features,
        entityset=entityset,
        cutoff_time=cutoff_time,
        instance_ids=instance_ids,
        training_window=training_window,
        approximate=approximate,
        cutoff_time_in_index=cutoff_time_in_index,
        save_progress=save_progress,
        chunk_size=chunk_size,
        n_jobs=n_jobs,
        dask_kwargs=dask_kwargs,
        verbose=verbose,
        progress_callback=progress_callback,
        include_cutoff_time=include_cutoff_time,
    )
    return feature_matrix, features


def warn_unused_primitives(unused_primitives):
    messages = [
        "  trans_primitives: {}\n",
        "  agg_primitives: {}\n",
        "  groupby_trans_primitives: {}\n",
        "  where_primitives: {}\n",
    ]
    unused_string = ""
    for primitives, message in zip(unused_primitives, messages):
        if primitives:
            unused_string += message.format(primitives)

    warning_msg = (
        "Some specified primitives were not used during DFS:\n{}".format(unused_string)
        + "This may be caused by a using a value of max_depth that is too small, not setting interesting values, "
        + "or it may indicate no compatible columns for the primitive were found in the data. If the DFS call "
        + "contained multiple instances of a primitive in the list above, none of them were used."
    )

    warnings.warn(warning_msg, UnusedPrimitiveWarning)


================================================
FILE: featuretools/synthesis/encode_features.py
================================================
import logging

import pandas as pd

from featuretools.computational_backends.utils import get_ww_types_from_features
from featuretools.utils.gen_utils import make_tqdm_iterator

logger = logging.getLogger("featuretools")

DEFAULT_TOP_N = 10


def encode_features(
    feature_matrix,
    features,
    top_n=DEFAULT_TOP_N,
    include_unknown=True,
    to_encode=None,
    inplace=False,
    drop_first=False,
    verbose=False,
):
    """Encode categorical features

    Args:
        feature_matrix (pd.DataFrame): Dataframe of features.
        features (list[PrimitiveBase]): Feature definitions in feature_matrix.
        top_n (int or dict[string -> int]): Number of top values to include.
            If dict[string -> int] is used, key is feature name and value is
            the number of top values to include for that feature.
            If a feature's name is not in dictionary, a default value of 10 is used.
        include_unknown (pd.DataFrame): Add feature encoding an unknown class.
            defaults to True
        to_encode (list[str]): List of feature names to encode.
            features not in this list are unencoded in the output matrix
            defaults to encode all necessary features.
        inplace (bool): Encode feature_matrix in place. Defaults to False.
        drop_first (bool): Whether to get k-1 dummies out of k categorical
                levels by removing the first level.
                defaults to False
        verbose (str): Print progress info.

    Returns:
        (pd.Dataframe, list) : encoded feature_matrix, encoded features

    Example:
        .. ipython:: python
            :suppress:

            from featuretools.tests.testing_utils import make_ecommerce_entityset
            import featuretools as ft
            es = make_ecommerce_entityset()

        .. ipython:: python

            f1 = ft.Feature(es["log"].ww["product_id"])
            f2 = ft.Feature(es["log"].ww["purchased"])
            f3 = ft.Feature(es["log"].ww["value"])

            features = [f1, f2, f3]
            ids = [0, 1, 2, 3, 4, 5]
            feature_matrix = ft.calculate_feature_matrix(features, es,
                                                         instance_ids=ids)

            fm_encoded, f_encoded = ft.encode_features(feature_matrix,
                                                       features)
            f_encoded

            fm_encoded, f_encoded = ft.encode_features(feature_matrix,
                                                       features, top_n=2)
            f_encoded

            fm_encoded, f_encoded = ft.encode_features(feature_matrix, features,
                                                       include_unknown=False)
            f_encoded

            fm_encoded, f_encoded = ft.encode_features(feature_matrix, features,
                                                       to_encode=['purchased'])
            f_encoded

            fm_encoded, f_encoded = ft.encode_features(feature_matrix, features,
                                                       drop_first=True)
            f_encoded
    """
    if inplace:
        X = feature_matrix
    else:
        X = feature_matrix.copy()

    old_feature_names = set()
    for feature in features:
        for fname in feature.get_feature_names():
            assert fname in X.columns, "Feature %s not found in feature matrix" % (
                fname
            )
            old_feature_names.add(fname)

    pass_through = [col for col in X.columns if col not in old_feature_names]

    if verbose:
        iterator = make_tqdm_iterator(
            iterable=features,
            total=len(features),
            desc="Encoding pass 1",
            unit="feature",
        )
    else:
        iterator = features

    new_feature_list = []
    kept_columns = []
    encoded_columns = []
    columns_info = feature_matrix.ww.columns

    for f in iterator:
        # TODO: features with multiple columns are not encoded by this method,
        # which can cause an "encoded" matrix with non-numeric values
        is_discrete = {"category", "foreign_key"}.intersection(
            f.column_schema.semantic_tags,
        )
        if f.number_output_features > 1 or not is_discrete:
            if f.number_output_features > 1:
                logger.warning(
                    "Feature %s has multiple columns and will not "
                    "be encoded.  This may result in a matrix with"
                    " non-numeric values." % (f),
                )
            new_feature_list.append(f)
            kept_columns.extend(f.get_feature_names())
            continue

        if to_encode is not None and f.get_name() not in to_encode:
            new_feature_list.append(f)
            kept_columns.extend(f.get_feature_names())
            continue

        val_counts = X[f.get_name()].value_counts()
        # Remove 0 count category values
        val_counts = val_counts[val_counts > 0].to_frame()
        index_name = val_counts.index.name
        val_counts = val_counts.rename(columns={val_counts.columns[0]: "count"})
        if index_name is None:
            if "index" in val_counts.columns:
                index_name = "level_0"
            else:
                index_name = "index"
        val_counts.reset_index(inplace=True)
        val_counts = val_counts.sort_values(["count", index_name], ascending=False)
        val_counts.set_index(index_name, inplace=True)
        select_n = top_n
        if isinstance(top_n, dict):
            select_n = top_n.get(f.get_name(), DEFAULT_TOP_N)
        if drop_first:
            select_n = min(len(val_counts), top_n)
            select_n = max(select_n - 1, 1)
        unique = val_counts.head(select_n).index.tolist()
        for label in unique:
            add = f == label
            add_name = add.get_name()
            new_feature_list.append(add)
            new_col = X[f.get_name()] == label
            new_col.rename(add_name, inplace=True)
            encoded_columns.append(new_col)

        if include_unknown:
            unknown = f.isin(unique).NOT().rename(f.get_name() + " is unknown")
            unknown_name = unknown.get_name()
            new_feature_list.append(unknown)
            new_col = ~X[f.get_name()].isin(unique)
            new_col.rename(unknown_name, inplace=True)
            encoded_columns.append(new_col)

        if inplace:
            X.drop(f.get_name(), axis=1, inplace=True)

    kept_columns.extend(pass_through)

    if inplace:
        for encoded_column in encoded_columns:
            X[encoded_column.name] = encoded_column
    else:
        X = pd.concat([X[kept_columns]] + encoded_columns, axis=1)

    entityset = new_feature_list[0].entityset
    ww_init_kwargs = get_ww_types_from_features(new_feature_list, entityset)

    # Grab ww metadata from feature matrix since it may be more exact
    for column in kept_columns:
        ww_init_kwargs["logical_types"][column] = columns_info[column].logical_type
        ww_init_kwargs["semantic_tags"][column] = columns_info[column].semantic_tags
        ww_init_kwargs["column_origins"][column] = columns_info[column].origin

    X.ww.init(**ww_init_kwargs)
    return X, new_feature_list


================================================
FILE: featuretools/synthesis/get_valid_primitives.py
================================================
from featuretools.primitives import AggregationPrimitive, TransformPrimitive
from featuretools.primitives.utils import (
    get_aggregation_primitives,
    get_transform_primitives,
)
from featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis
from featuretools.synthesis.utils import _categorize_features, get_unused_primitives


def get_valid_primitives(
    entityset,
    target_dataframe_name,
    max_depth=2,
    selected_primitives=None,
    **dfs_kwargs,
):
    """
    Returns two lists of primitives (transform and aggregation) containing
    primitives that can be applied to the specific target dataframe to create
    features.  If the optional 'selected_primitives' parameter is not used,
    all discoverable primitives will be considered.

    Note:
        When using a ``max_depth`` greater than 1, some primitives returned by
        this function may not create any features if passed to DFS alone.  These
        primitives relied on features created by other primitives as input
        (primitive stacking).

    Args:
        entityset (EntitySet): An already initialized entityset
        target_dataframe_name (str): Name of dataframe to create features for.
        max_depth (int, optional): Maximum allowed depth of features.
        selected_primitives(list[str or AggregationPrimitive/TransformPrimitive], optional):
            list of primitives to consider when looking for valid primitives.
            If None, all primitives will be considered
        dfs_kwargs (keywords): Additional keyword arguments to pass as keyword arguments to
            the DeepFeatureSynthesis object. Should not include ``max_depth``, ``agg_primitives``,
            or ``trans_primitives``, as those are passed in explicity.
    Returns:
       list[AggregationPrimitive], list[TransformPrimitive]:
           The list of valid aggregation primitives and the list of valid
           transform primitives.
    """
    agg_primitives = []
    trans_primitives = []
    available_aggs = get_aggregation_primitives()
    available_trans = get_transform_primitives()

    if selected_primitives:
        for prim in selected_primitives:
            if not isinstance(prim, str):
                if issubclass(prim, AggregationPrimitive):
                    prim_list = agg_primitives
                elif issubclass(prim, TransformPrimitive):
                    prim_list = trans_primitives
                else:
                    raise ValueError(
                        f"Selected primitive {prim} is not an "
                        "AggregationPrimitive, TransformPrimitive, or str",
                    )
            elif prim in available_aggs:
                prim = available_aggs[prim]
                prim_list = agg_primitives
            elif prim in available_trans:
                prim = available_trans[prim]
                prim_list = trans_primitives
            else:
                raise ValueError(f"'{prim}' is not a recognized primitive name")
            prim_list.append(prim)
    else:
        agg_primitives = [agg for agg in available_aggs.values()]
        trans_primitives = [trans for trans in available_trans.values()]

    dfs_object = DeepFeatureSynthesis(
        target_dataframe_name,
        entityset,
        agg_primitives=agg_primitives,
        trans_primitives=trans_primitives,
        max_depth=max_depth,
        **dfs_kwargs,
    )

    features = dfs_object.build_features()

    trans, agg, _, _ = _categorize_features(features)

    trans_unused = get_unused_primitives(trans_primitives, trans)
    agg_unused = get_unused_primitives(agg_primitives, agg)

    # switch from str to class
    agg_unused = [available_aggs[name] for name in agg_unused]
    trans_unused = [available_trans[name] for name in trans_unused]

    used_agg_prims = set(agg_primitives).difference(set(agg_unused))
    used_trans_prims = set(trans_primitives).difference(set(trans_unused))
    return list(used_agg_prims), list(used_trans_prims)


================================================
FILE: featuretools/synthesis/utils.py
================================================
from featuretools.feature_base import (
    AggregationFeature,
    FeatureOutputSlice,
    GroupByTransformFeature,
    TransformFeature,
)
from featuretools.utils.gen_utils import camel_and_title_to_snake


def _categorize_features(features):
    """Categorize each feature by its primitive type in a set of primitives along with any dependencies"""
    transform = set()
    agg = set()
    groupby = set()
    where = set()
    explored = set()

    def get_feature_data(feature):
        if feature.get_name() in explored:
            return

        dependencies = []

        if isinstance(feature, FeatureOutputSlice):
            feature = feature.base_feature

        if isinstance(feature, AggregationFeature):
            if feature.where:
                where.add(feature.primitive.name)
            else:
                agg.add(feature.primitive.name)
        elif isinstance(feature, GroupByTransformFeature):
            groupby.add(feature.primitive.name)
        elif isinstance(feature, TransformFeature):
            transform.add(feature.primitive.name)

        feature_deps = feature.get_dependencies()
        if feature_deps:
            dependencies.extend(feature_deps)

        explored.add(feature.get_name())

        for dep in dependencies:
            get_feature_data(dep)

    for feature in features:
        get_feature_data(feature)

    return transform, agg, groupby, where


def get_unused_primitives(specified, used):
    """Get a list of unused primitives based on a list of specified primitives and a list of output features"""
    if not specified:
        return []
    specified = {
        camel_and_title_to_snake(primitive)
        if isinstance(primitive, str)
        else primitive.name
        for primitive in specified
    }
    return sorted(specified.difference(used))


================================================
FILE: featuretools/tests/__init__.py
================================================


================================================
FILE: featuretools/tests/computational_backend/__init__.py
================================================


================================================
FILE: featuretools/tests/computational_backend/test_calculate_feature_matrix.py
================================================
import logging
import os
import re
import shutil
from datetime import datetime
from itertools import combinations
from random import randint

import numpy as np
import pandas as pd
import psutil
import pytest
from tqdm import tqdm
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import (
    Age,
    AgeNullable,
    Boolean,
    BooleanNullable,
    Integer,
    IntegerNullable,
)

from featuretools import (
    EntitySet,
    Feature,
    GroupByTransformFeature,
    Timedelta,
    calculate_feature_matrix,
    dfs,
)
from featuretools.computational_backends import utils
from featuretools.computational_backends.calculate_feature_matrix import (
    FEATURE_CALCULATION_PERCENTAGE,
    _chunk_dataframe_groups,
    _handle_chunk_size,
    scatter_warning,
)
from featuretools.computational_backends.utils import (
    bin_cutoff_times,
    create_client_and_cluster,
    n_jobs_to_workers,
)
from featuretools.feature_base import (
    AggregationFeature,
    DirectFeature,
    FeatureOutputSlice,
    IdentityFeature,
)
from featuretools.primitives import (
    Count,
    Max,
    Min,
    Negate,
    NMostCommon,
    Percentile,
    Sum,
    TransformPrimitive,
)
from featuretools.tests.testing_utils import (
    backward_path,
    get_mock_client_cluster,
)


def test_scatter_warning(caplog):
    logger = logging.getLogger("featuretools")
    match = "EntitySet was only scattered to {} out of {} workers"
    warning_message = match.format(1, 2)
    logger.propagate = True
    scatter_warning(1, 2)
    logger.propagate = False
    assert warning_message in caplog.text


def test_calc_feature_matrix(es):
    times = list(
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]
        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]
        + [datetime(2011, 4, 9, 10, 40, 0)]
        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]
        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]
        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)],
    )
    instances = range(17)
    cutoff_time = pd.DataFrame({"time": times, es["log"].ww.index: instances})
    labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2

    property_feature = Feature(es["log"].ww["value"]) > 10

    feature_matrix = calculate_feature_matrix(
        [property_feature],
        es,
        cutoff_time=cutoff_time,
        verbose=True,
    )

    assert (feature_matrix[property_feature.get_name()] == labels).values.all()

    error_text = "features must be a non-empty list of features"
    with pytest.raises(AssertionError, match=error_text):
        feature_matrix = calculate_feature_matrix(
            "features",
            es,
            cutoff_time=cutoff_time,
        )

    with pytest.raises(AssertionError, match=error_text):
        feature_matrix = calculate_feature_matrix([], es, cutoff_time=cutoff_time)

    with pytest.raises(AssertionError, match=error_text):
        feature_matrix = calculate_feature_matrix(
            [1, 2, 3],
            es,
            cutoff_time=cutoff_time,
        )

    error_text = (
        "cutoff_time times must be datetime type: try casting via "
        "pd\\.to_datetime\\(\\)"
    )
    with pytest.raises(TypeError, match=error_text):
        calculate_feature_matrix(
            [property_feature],
            es,
            instance_ids=range(17),
            cutoff_time=17,
        )

    error_text = "cutoff_time must be a single value or DataFrame"
    with pytest.raises(TypeError, match=error_text):
        calculate_feature_matrix(
            [property_feature],
            es,
            instance_ids=range(17),
            cutoff_time=times,
        )

    cutoff_times_dup = pd.DataFrame(
        {
            "time": [datetime(2018, 3, 1), datetime(2018, 3, 1)],
            es["log"].ww.index: [1, 1],
        },
    )

    error_text = "Duplicated rows in cutoff time dataframe."
    with pytest.raises(AssertionError, match=error_text):
        feature_matrix = calculate_feature_matrix(
            [property_feature],
            entityset=es,
            cutoff_time=cutoff_times_dup,
        )

    cutoff_reordered = cutoff_time.iloc[[-1, 10, 1]]  # 3 ids not ordered by cutoff time
    feature_matrix = calculate_feature_matrix(
        [property_feature],
        es,
        cutoff_time=cutoff_reordered,
        verbose=True,
    )

    assert all(feature_matrix.index == cutoff_reordered["id"].values)


def test_cfm_compose(es, lt):
    property_feature = Feature(es["log"].ww["value"]) > 10

    feature_matrix = calculate_feature_matrix(
        [property_feature],
        es,
        cutoff_time=lt,
        verbose=True,
    )

    assert (
        feature_matrix[property_feature.get_name()] == feature_matrix["label_func"]
    ).values.all()


def test_cfm_compose_approximate(es, lt):
    property_feature = Feature(es["log"].ww["value"]) > 10

    feature_matrix = calculate_feature_matrix(
        [property_feature],
        es,
        cutoff_time=lt,
        approximate="1s",
        verbose=True,
    )
    assert type(feature_matrix) == pd.core.frame.DataFrame

    assert (
        feature_matrix[property_feature.get_name()] == feature_matrix["label_func"]
    ).values.all()


def test_cfm_approximate_correct_ordering():
    trips = {
        "trip_id": [i for i in range(1000)],
        "flight_time": [datetime(1998, 4, 2) for i in range(350)]
        + [datetime(1997, 4, 3) for i in range(650)],
        "flight_id": [randint(1, 25) for i in range(1000)],
        "trip_duration": [randint(1, 999) for i in range(1000)],
    }
    df = pd.DataFrame.from_dict(trips)
    es = EntitySet("flights")
    es.add_dataframe(
        dataframe_name="trips",
        dataframe=df,
        index="trip_id",
        time_index="flight_time",
    )
    es.normalize_dataframe(
        base_dataframe_name="trips",
        new_dataframe_name="flights",
        index="flight_id",
        make_time_index=True,
    )
    features = dfs(entityset=es, target_dataframe_name="trips", features_only=True)
    flight_features = [
        feature
        for feature in features
        if isinstance(feature, DirectFeature)
        and isinstance(feature.base_features[0], AggregationFeature)
    ]
    property_feature = IdentityFeature(es["trips"].ww["trip_id"])

    cutoff_time = pd.DataFrame.from_dict(
        {"instance_id": df["trip_id"], "time": df["flight_time"]},
    )
    time_feature = IdentityFeature(es["trips"].ww["flight_time"])
    feature_matrix = calculate_feature_matrix(
        flight_features + [property_feature, time_feature],
        es,
        cutoff_time_in_index=True,
        cutoff_time=cutoff_time,
    )
    feature_matrix.index.names = ["instance", "time"]
    assert np.all(
        feature_matrix.reset_index("time").reset_index()[["instance", "time"]].values
        == feature_matrix[["trip_id", "flight_time"]].values,
    )
    feature_matrix_2 = calculate_feature_matrix(
        flight_features + [property_feature, time_feature],
        es,
        cutoff_time=cutoff_time,
        cutoff_time_in_index=True,
        approximate=Timedelta(2, "d"),
    )
    feature_matrix_2.index.names = ["instance", "time"]
    assert np.all(
        feature_matrix_2.reset_index("time").reset_index()[["instance", "time"]].values
        == feature_matrix_2[["trip_id", "flight_time"]].values,
    )
    for column in feature_matrix:
        for x, y in zip(feature_matrix[column], feature_matrix_2[column]):
            assert (pd.isnull(x) and pd.isnull(y)) or (x == y)


def test_cfm_no_cutoff_time_index(es):
    agg_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    agg_feat4 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum)
    dfeat = DirectFeature(agg_feat4, "sessions")
    cutoff_time = pd.DataFrame(
        {
            "time": [datetime(2013, 4, 9, 10, 31, 19), datetime(2013, 4, 9, 11, 0, 0)],
            "instance_id": [0, 2],
        },
    )
    feature_matrix = calculate_feature_matrix(
        [dfeat, agg_feat],
        es,
        cutoff_time_in_index=False,
        approximate=Timedelta(12, "s"),
        cutoff_time=cutoff_time,
    )
    assert feature_matrix.index.name == "id"
    assert feature_matrix.index.tolist() == [0, 2]
    assert feature_matrix[dfeat.get_name()].tolist() == [10, 10]
    assert feature_matrix[agg_feat.get_name()].tolist() == [5, 1]

    cutoff_time = pd.DataFrame(
        {
            "time": [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)],
            "instance_id": [0, 2],
        },
    )
    feature_matrix_2 = calculate_feature_matrix(
        [dfeat, agg_feat],
        es,
        cutoff_time_in_index=False,
        approximate=Timedelta(10, "s"),
        cutoff_time=cutoff_time,
    )
    assert feature_matrix_2.index.name == "id"
    assert feature_matrix_2.index.tolist() == [0, 2]
    assert feature_matrix_2[dfeat.get_name()].tolist() == [7, 10]
    assert feature_matrix_2[agg_feat.get_name()].tolist() == [5, 1]


def test_cfm_duplicated_index_in_cutoff_time(es):
    times = [
        datetime(2011, 4, 1),
        datetime(2011, 5, 1),
        datetime(2011, 4, 1),
        datetime(2011, 5, 1),
    ]

    instances = [1, 1, 2, 2]
    property_feature = Feature(es["log"].ww["value"]) > 10
    cutoff_time = pd.DataFrame({"id": instances, "time": times}, index=[1, 1, 1, 1])

    feature_matrix = calculate_feature_matrix(
        [property_feature],
        es,
        cutoff_time=cutoff_time,
        chunk_size=1,
    )
    assert feature_matrix.shape[0] == cutoff_time.shape[0]


def test_saveprogress(es, tmp_path):
    times = list(
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]
        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]
        + [datetime(2011, 4, 9, 10, 40, 0)]
        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]
        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]
        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)],
    )
    cutoff_time = pd.DataFrame({"time": times, "instance_id": range(17)})
    property_feature = Feature(es["log"].ww["value"]) > 10
    save_progress = str(tmp_path)
    fm_save = calculate_feature_matrix(
        [property_feature],
        es,
        cutoff_time=cutoff_time,
        save_progress=save_progress,
    )
    _, _, files = next(os.walk(save_progress))
    files = [os.path.join(save_progress, file) for file in files]
    # there are 17 datetime files created above
    assert len(files) == 17
    list_df = []
    for file_ in files:
        df = pd.read_csv(file_, index_col="id", header=0)
        list_df.append(df)
    merged_df = pd.concat(list_df)
    merged_df.set_index(pd.DatetimeIndex(times), inplace=True, append=True)
    fm_no_save = calculate_feature_matrix(
        [property_feature],
        es,
        cutoff_time=cutoff_time,
    )
    assert np.all((merged_df.sort_index().values) == (fm_save.sort_index().values))
    assert np.all((fm_no_save.sort_index().values) == (fm_save.sort_index().values))
    assert np.all((fm_no_save.sort_index().values) == (merged_df.sort_index().values))
    shutil.rmtree(save_progress)


def test_cutoff_time_correctly(es):
    property_feature = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    times = [datetime(2011, 4, 10), datetime(2011, 4, 11), datetime(2011, 4, 7)]
    cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 1, 2]})
    feature_matrix = calculate_feature_matrix(
        [property_feature],
        es,
        cutoff_time=cutoff_time,
    )
    labels = [10, 5, 0]
    assert (feature_matrix[property_feature.get_name()] == labels).values.all()


def test_cutoff_time_binning():
    cutoff_time = pd.DataFrame(
        {
            "time": [
                datetime(2011, 4, 9, 12, 31),
                datetime(2011, 4, 10, 11),
                datetime(2011, 4, 10, 13, 10, 1),
            ],
            "instance_id": [1, 2, 3],
        },
    )
    cutoff_time.ww.init()
    binned_cutoff_times = bin_cutoff_times(cutoff_time, Timedelta(4, "h"))
    labels = [
        datetime(2011, 4, 9, 12),
        datetime(2011, 4, 10, 8),
        datetime(2011, 4, 10, 12),
    ]
    for i in binned_cutoff_times.index:
        assert binned_cutoff_times["time"][i] == labels[i]

    binned_cutoff_times = bin_cutoff_times(cutoff_time, Timedelta(25, "h"))
    labels = [
        datetime(2011, 4, 8, 22),
        datetime(2011, 4, 9, 23),
        datetime(2011, 4, 9, 23),
    ]
    for i in binned_cutoff_times.index:
        assert binned_cutoff_times["time"][i] == labels[i]

    error_text = "Unit is relative"
    with pytest.raises(ValueError, match=error_text):
        binned_cutoff_times = bin_cutoff_times(cutoff_time, Timedelta(1, "mo"))


def test_cutoff_time_columns_order(es):
    property_feature = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    times = [datetime(2011, 4, 10), datetime(2011, 4, 11), datetime(2011, 4, 7)]
    id_col_names = ["instance_id", es["customers"].ww.index]
    time_col_names = ["time", es["customers"].ww.time_index]
    for id_col in id_col_names:
        for time_col in time_col_names:
            cutoff_time = pd.DataFrame(
                {
                    "dummy_col_1": [1, 2, 3],
                    id_col: [0, 1, 2],
                    "dummy_col_2": [True, False, False],
                    time_col: times,
                },
            )
            feature_matrix = calculate_feature_matrix(
                [property_feature],
                es,
                cutoff_time=cutoff_time,
            )

            labels = [10, 5, 0]
            assert (feature_matrix[property_feature.get_name()] == labels).values.all()


def test_cutoff_time_df_redundant_column_names(es):
    property_feature = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    times = [datetime(2011, 4, 10), datetime(2011, 4, 11), datetime(2011, 4, 7)]

    cutoff_time = pd.DataFrame(
        {
            es["customers"].ww.index: [0, 1, 2],
            "instance_id": [0, 1, 2],
            "dummy_col": [True, False, False],
            "time": times,
        },
    )
    err_msg = (
        'Cutoff time DataFrame cannot contain both a column named "instance_id" and a column'
        " with the same name as the target dataframe index"
    )
    with pytest.raises(AttributeError, match=err_msg):
        calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_time)

    cutoff_time = pd.DataFrame(
        {
            es["customers"].ww.time_index: [0, 1, 2],
            "instance_id": [0, 1, 2],
            "dummy_col": [True, False, False],
            "time": times,
        },
    )
    err_msg = (
        'Cutoff time DataFrame cannot contain both a column named "time" and a column'
        " with the same name as the target dataframe time index"
    )
    with pytest.raises(AttributeError, match=err_msg):
        calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_time)


def test_training_window(es):
    property_feature = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    top_level_agg = Feature(
        es["customers"].ww["id"],
        parent_dataframe_name="régions",
        primitive=Count,
    )

    # make sure features that have a direct to a higher level agg
    # so we have multiple "filter eids" in get_pandas_data_slice,
    # and we go through the loop to pull data with a training_window param more than once
    dagg = DirectFeature(top_level_agg, "customers")

    # for now, warns if last_time_index not present
    times = [
        datetime(2011, 4, 9, 12, 31),
        datetime(2011, 4, 10, 11),
        datetime(2011, 4, 10, 13, 10),
    ]
    cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 1, 2]})
    warn_text = (
        "Using training_window but last_time_index is not set for dataframe customers"
    )
    with pytest.warns(UserWarning, match=warn_text):
        feature_matrix = calculate_feature_matrix(
            [property_feature, dagg],
            es,
            cutoff_time=cutoff_time,
            training_window="2 hours",
        )

    es.add_last_time_indexes()

    error_text = "Training window cannot be in observations"
    with pytest.raises(AssertionError, match=error_text):
        feature_matrix = calculate_feature_matrix(
            [property_feature],
            es,
            cutoff_time=cutoff_time,
            training_window=Timedelta(2, "observations"),
        )

    # Case1. include_cutoff_time = True
    feature_matrix = calculate_feature_matrix(
        [property_feature, dagg],
        es,
        cutoff_time=cutoff_time,
        training_window="2 hours",
        include_cutoff_time=True,
    )
    prop_values = [4, 5, 1]
    dagg_values = [3, 2, 1]
    assert (feature_matrix[property_feature.get_name()] == prop_values).values.all()
    assert (feature_matrix[dagg.get_name()] == dagg_values).values.all()

    # Case2. include_cutoff_time = False
    feature_matrix = calculate_feature_matrix(
        [property_feature, dagg],
        es,
        cutoff_time=cutoff_time,
        training_window="2 hours",
        include_cutoff_time=False,
    )
    prop_values = [5, 5, 2]
    dagg_values = [3, 2, 1]

    assert (feature_matrix[property_feature.get_name()] == prop_values).values.all()
    assert (feature_matrix[dagg.get_name()] == dagg_values).values.all()

    # Case3. include_cutoff_time = False with single cutoff time value
    feature_matrix = calculate_feature_matrix(
        [property_feature, dagg],
        es,
        cutoff_time=pd.to_datetime("2011-04-09 10:40:00"),
        training_window="9 minutes",
        include_cutoff_time=False,
    )
    prop_values = [0, 4, 0]
    dagg_values = [3, 3, 3]
    assert (feature_matrix[property_feature.get_name()] == prop_values).values.all()
    assert (feature_matrix[dagg.get_name()] == dagg_values).values.all()

    # Case4. include_cutoff_time = True with single cutoff time value
    feature_matrix = calculate_feature_matrix(
        [property_feature, dagg],
        es,
        cutoff_time=pd.to_datetime("2011-04-10 10:40:00"),
        training_window="2 days",
        include_cutoff_time=True,
    )
    prop_values = [0, 10, 1]
    dagg_values = [3, 3, 3]
    assert (feature_matrix[property_feature.get_name()] == prop_values).values.all()
    assert (feature_matrix[dagg.get_name()] == dagg_values).values.all()


def test_training_window_overlap(es):
    es.add_last_time_indexes()

    count_log = Feature(
        Feature(es["log"].ww["id"]),
        parent_dataframe_name="customers",
        primitive=Count,
    )

    cutoff_time = pd.DataFrame(
        {
            "id": [0, 0],
            "time": ["2011-04-09 10:30:00", "2011-04-09 10:40:00"],
        },
    ).astype({"time": "datetime64[ns]"})

    # Case1. include_cutoff_time = True
    actual = calculate_feature_matrix(
        features=[count_log],
        entityset=es,
        cutoff_time=cutoff_time,
        cutoff_time_in_index=True,
        training_window="10 minutes",
        include_cutoff_time=True,
    )
    actual = actual["COUNT(log)"]
    np.testing.assert_array_equal(actual.values, [1, 9])

    # Case2. include_cutoff_time = False
    actual = calculate_feature_matrix(
        features=[count_log],
        entityset=es,
        cutoff_time=cutoff_time,
        cutoff_time_in_index=True,
        training_window="10 minutes",
        include_cutoff_time=False,
    )
    actual = actual["COUNT(log)"]
    np.testing.assert_array_equal(actual.values, [0, 9])


def test_include_cutoff_time_without_training_window(es):
    es.add_last_time_indexes()

    count_log = Feature(
        base=Feature(es["log"].ww["id"]),
        parent_dataframe_name="customers",
        primitive=Count,
    )

    cutoff_time = pd.DataFrame(
        {
            "id": [0, 0],
            "time": ["2011-04-09 10:30:00", "2011-04-09 10:31:00"],
        },
    ).astype({"time": "datetime64[ns]"})

    # Case1. include_cutoff_time = True
    actual = calculate_feature_matrix(
        features=[count_log],
        entityset=es,
        cutoff_time=cutoff_time,
        cutoff_time_in_index=True,
        include_cutoff_time=True,
    )
    actual = actual["COUNT(log)"]
    np.testing.assert_array_equal(actual.values, [1, 6])

    # Case2. include_cutoff_time = False
    actual = calculate_feature_matrix(
        features=[count_log],
        entityset=es,
        cutoff_time=cutoff_time,
        cutoff_time_in_index=True,
        include_cutoff_time=False,
    )
    actual = actual["COUNT(log)"]
    np.testing.assert_array_equal(actual.values, [0, 5])

    # Case3. include_cutoff_time = True with single cutoff time value
    actual = calculate_feature_matrix(
        features=[count_log],
        entityset=es,
        cutoff_time=pd.to_datetime("2011-04-09 10:31:00"),
        instance_ids=[0],
        cutoff_time_in_index=True,
        include_cutoff_time=True,
    )
    actual = actual["COUNT(log)"]
    np.testing.assert_array_equal(actual.values, [6])

    # Case4. include_cutoff_time = False with single cutoff time value
    actual = calculate_feature_matrix(
        features=[count_log],
        entityset=es,
        cutoff_time=pd.to_datetime("2011-04-09 10:31:00"),
        instance_ids=[0],
        cutoff_time_in_index=True,
        include_cutoff_time=False,
    )
    actual = actual["COUNT(log)"]
    np.testing.assert_array_equal(actual.values, [5])


def test_approximate_dfeat_of_agg_on_target_include_cutoff_time(es):
    agg_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum)
    dfeat = DirectFeature(agg_feat2, "sessions")

    cutoff_time = pd.DataFrame(
        {"time": [datetime(2011, 4, 9, 10, 31, 19)], "instance_id": [0]},
    )
    feature_matrix = calculate_feature_matrix(
        [dfeat, agg_feat2, agg_feat],
        es,
        approximate=Timedelta(20, "s"),
        cutoff_time=cutoff_time,
        include_cutoff_time=False,
    )

    # binned cutoff_time will be datetime(2011, 4, 9, 10, 31, 0) and
    # log event 5 at datetime(2011, 4, 9, 10, 31, 0) will be
    # excluded due to approximate cutoff time point
    assert feature_matrix[dfeat.get_name()].tolist() == [5]
    assert feature_matrix[agg_feat.get_name()].tolist() == [5]

    feature_matrix = calculate_feature_matrix(
        [dfeat, agg_feat],
        es,
        approximate=Timedelta(20, "s"),
        cutoff_time=cutoff_time,
        include_cutoff_time=True,
    )

    # binned cutoff_time will be datetime(2011, 4, 9, 10, 31, 0) and
    # log event 5 at datetime(2011, 4, 9, 10, 31, 0) will be
    # included due to approximate cutoff time point
    assert feature_matrix[dfeat.get_name()].tolist() == [6]
    assert feature_matrix[agg_feat.get_name()].tolist() == [5]


def test_training_window_recent_time_index(es):
    # customer with no sessions
    row = {
        "id": [3],
        "age": [73],
        "région_id": ["United States"],
        "cohort": [1],
        "cancel_reason": ["Lost interest"],
        "loves_ice_cream": [True],
        "favorite_quote": ["Don't look back. Something might be gaining on you."],
        "signup_date": [datetime(2011, 4, 10)],
        "upgrade_date": [datetime(2011, 4, 12)],
        "cancel_date": [datetime(2011, 5, 13)],
        "birthday": [datetime(1938, 2, 1)],
        "engagement_level": [2],
    }
    to_add_df = pd.DataFrame(row)
    to_add_df.index = range(3, 4)

    # have to convert category to int in order to concat
    old_df = es["customers"]
    old_df.index = old_df.index.astype("int")
    old_df["id"] = old_df["id"].astype(int)

    df = pd.concat([old_df, to_add_df], sort=True)

    # convert back after
    df.index = df.index.astype("category")
    df["id"] = df["id"].astype("category")

    es.replace_dataframe(
        dataframe_name="customers",
        df=df,
        recalculate_last_time_indexes=False,
    )
    es.add_last_time_indexes()

    property_feature = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    top_level_agg = Feature(
        es["customers"].ww["id"],
        parent_dataframe_name="régions",
        primitive=Count,
    )
    dagg = DirectFeature(top_level_agg, "customers")
    instance_ids = [0, 1, 2, 3]
    times = [
        datetime(2011, 4, 9, 12, 31),
        datetime(2011, 4, 10, 11),
        datetime(2011, 4, 10, 13, 10, 1),
        datetime(2011, 4, 10, 1, 59, 59),
    ]
    cutoff_time = pd.DataFrame({"time": times, "instance_id": instance_ids})

    # Case1. include_cutoff_time = True
    feature_matrix = calculate_feature_matrix(
        [property_feature, dagg],
        es,
        cutoff_time=cutoff_time,
        training_window="2 hours",
        include_cutoff_time=True,
    )
    prop_values = [4, 5, 1, 0]
    assert (feature_matrix[property_feature.get_name()] == prop_values).values.all()

    dagg_values = [3, 2, 1, 3]
    feature_matrix.sort_index(inplace=True)
    assert (feature_matrix[dagg.get_name()] == dagg_values).values.all()

    # Case2. include_cutoff_time = False
    feature_matrix = calculate_feature_matrix(
        [property_feature, dagg],
        es,
        cutoff_time=cutoff_time,
        training_window="2 hours",
        include_cutoff_time=False,
    )
    prop_values = [5, 5, 1, 0]
    assert (feature_matrix[property_feature.get_name()] == prop_values).values.all()

    dagg_values = [3, 2, 1, 3]
    feature_matrix.sort_index(inplace=True)
    assert (feature_matrix[dagg.get_name()] == dagg_values).values.all()


def test_approximate_multiple_instances_per_cutoff_time(es):
    agg_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum)
    dfeat = DirectFeature(agg_feat2, "sessions")
    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]
    cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]})
    feature_matrix = calculate_feature_matrix(
        [dfeat, agg_feat],
        es,
        approximate=Timedelta(1, "week"),
        cutoff_time=cutoff_time,
    )
    assert feature_matrix.shape[0] == 2
    assert feature_matrix[agg_feat.get_name()].tolist() == [5, 1]


def test_approximate_with_multiple_paths(diamond_es):
    es = diamond_es
    path = backward_path(es, ["regions", "customers", "transactions"])
    agg_feat = AggregationFeature(
        Feature(es["transactions"].ww["id"]),
        parent_dataframe_name="regions",
        relationship_path=path,
        primitive=Count,
    )
    dfeat = DirectFeature(agg_feat, "customers")
    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]
    cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]})
    feature_matrix = calculate_feature_matrix(
        [dfeat],
        es,
        approximate=Timedelta(1, "week"),
        cutoff_time=cutoff_time,
    )
    assert feature_matrix[dfeat.get_name()].tolist() == [6, 2]


def test_approximate_dfeat_of_agg_on_target(es):
    agg_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum)
    dfeat = DirectFeature(agg_feat2, "sessions")
    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]
    cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]})
    feature_matrix = calculate_feature_matrix(
        [dfeat, agg_feat],
        es,
        approximate=Timedelta(10, "s"),
        cutoff_time=cutoff_time,
    )
    assert feature_matrix[dfeat.get_name()].tolist() == [7, 10]
    assert feature_matrix[agg_feat.get_name()].tolist() == [5, 1]


def test_approximate_dfeat_of_need_all_values(es):
    p = Feature(es["log"].ww["value"], primitive=Percentile)
    agg_feat = Feature(p, parent_dataframe_name="sessions", primitive=Sum)
    agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum)
    dfeat = DirectFeature(agg_feat2, "sessions")
    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]
    cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]})
    feature_matrix = calculate_feature_matrix(
        [dfeat, agg_feat],
        es,
        approximate=Timedelta(10, "s"),
        cutoff_time_in_index=True,
        cutoff_time=cutoff_time,
    )
    log_df = es["log"]
    instances = [0, 2]
    cutoffs = [pd.Timestamp("2011-04-09 10:31:19"), pd.Timestamp("2011-04-09 11:00:00")]
    approxes = [
        pd.Timestamp("2011-04-09 10:31:10"),
        pd.Timestamp("2011-04-09 11:00:00"),
    ]
    true_vals = []
    true_vals_approx = []
    for instance, cutoff, approx in zip(instances, cutoffs, approxes):
        log_data_cutoff = log_df[log_df["datetime"] < cutoff]
        log_data_cutoff["percentile"] = log_data_cutoff["value"].rank(pct=True)
        true_agg = (
            log_data_cutoff.loc[log_data_cutoff["session_id"] == instance, "percentile"]
            .fillna(0)
            .sum()
        )
        true_vals.append(round(true_agg, 3))

        log_data_approx = log_df[log_df["datetime"] < approx]
        log_data_approx["percentile"] = log_data_approx["value"].rank(pct=True)
        true_agg_approx = (
            log_data_approx.loc[
                log_data_approx["session_id"].isin([0, 1, 2]),
                "percentile",
            ]
            .fillna(0)
            .sum()
        )
        true_vals_approx.append(round(true_agg_approx, 3))
    lapprox = [round(x, 3) for x in feature_matrix[dfeat.get_name()].tolist()]
    test_list = [round(x, 3) for x in feature_matrix[agg_feat.get_name()].tolist()]
    assert lapprox == true_vals_approx
    assert test_list == true_vals


def test_uses_full_dataframe_feat_of_approximate(es):
    agg_feat = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="sessions",
        primitive=Sum,
    )
    agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum)
    agg_feat3 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Max)
    dfeat = DirectFeature(agg_feat2, "sessions")
    dfeat2 = DirectFeature(agg_feat3, "sessions")
    p = Feature(dfeat, primitive=Percentile)
    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]
    cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]})
    # only dfeat2 should be approximated
    # because Percentile needs all values

    feature_matrix_only_dfeat2 = calculate_feature_matrix(
        [dfeat2],
        es,
        approximate=Timedelta(10, "s"),
        cutoff_time_in_index=True,
        cutoff_time=cutoff_time,
    )
    assert feature_matrix_only_dfeat2[dfeat2.get_name()].tolist() == [50, 50]

    feature_matrix_approx = calculate_feature_matrix(
        [p, dfeat, dfeat2, agg_feat],
        es,
        approximate=Timedelta(10, "s"),
        cutoff_time_in_index=True,
        cutoff_time=cutoff_time,
    )
    assert (
        feature_matrix_only_dfeat2[dfeat2.get_name()].tolist()
        == feature_matrix_approx[dfeat2.get_name()].tolist()
    )

    feature_matrix_small_approx = calculate_feature_matrix(
        [p, dfeat, dfeat2, agg_feat],
        es,
        approximate=Timedelta(10, "ms"),
        cutoff_time_in_index=True,
        cutoff_time=cutoff_time,
    )

    feature_matrix_no_approx = calculate_feature_matrix(
        [p, dfeat, dfeat2, agg_feat],
        es,
        cutoff_time_in_index=True,
        cutoff_time=cutoff_time,
    )
    for f in [p, dfeat, agg_feat]:
        for fm1, fm2 in combinations(
            [
                feature_matrix_approx,
                feature_matrix_small_approx,
                feature_matrix_no_approx,
            ],
            2,
        ):
            assert fm1[f.get_name()].tolist() == fm2[f.get_name()].tolist()


def test_approximate_dfeat_of_dfeat_of_agg_on_target(es):
    agg_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum)
    dfeat = DirectFeature(Feature(agg_feat2, "sessions"), "log")
    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]
    cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]})
    feature_matrix = calculate_feature_matrix(
        [dfeat],
        es,
        approximate=Timedelta(10, "s"),
        cutoff_time=cutoff_time,
    )
    assert feature_matrix[dfeat.get_name()].tolist() == [7, 10]


def test_empty_path_approximate_full(es):
    es["sessions"].ww["customer_id"] = pd.Series(
        [np.nan, np.nan, np.nan, 1, 1, 2],
        dtype="category",
    )
    # Need to reassign the `foreign_key` tag as the column reassignment above removes it
    es["sessions"].ww.set_types(semantic_tags={"customer_id": "foreign_key"})
    agg_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum)
    dfeat = DirectFeature(agg_feat2, "sessions")
    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]
    cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]})
    feature_matrix = calculate_feature_matrix(
        [dfeat, agg_feat],
        es,
        approximate=Timedelta(10, "s"),
        cutoff_time=cutoff_time,
    )
    vals1 = feature_matrix[dfeat.get_name()].tolist()

    assert vals1[0] == 0
    assert vals1[1] == 0
    assert feature_matrix[agg_feat.get_name()].tolist() == [5, 1]


def test_approx_base_feature_is_also_first_class_feature(es):
    log_to_products = DirectFeature(Feature(es["products"].ww["rating"]), "log")
    # This should still be computed properly
    agg_feat = Feature(log_to_products, parent_dataframe_name="sessions", primitive=Min)
    customer_agg_feat = Feature(
        agg_feat,
        parent_dataframe_name="customers",
        primitive=Sum,
    )
    # This is to be approximated
    sess_to_cust = DirectFeature(customer_agg_feat, "sessions")
    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]
    cutoff_time = pd.DataFrame({"time": times, "instance_id": [0, 2]})
    feature_matrix = calculate_feature_matrix(
        [sess_to_cust, agg_feat],
        es,
        approximate=Timedelta(10, "s"),
        cutoff_time=cutoff_time,
    )

    vals1 = feature_matrix[sess_to_cust.get_name()].tolist()
    assert vals1 == [8.5, 7]
    vals2 = feature_matrix[agg_feat.get_name()].tolist()
    assert vals2 == [4, 1.5]


def test_approximate_time_split_returns_the_same_result(es):
    agg_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    agg_feat2 = Feature(agg_feat, parent_dataframe_name="customers", primitive=Sum)
    dfeat = DirectFeature(agg_feat2, "sessions")

    cutoff_df = pd.DataFrame(
        {
            "time": [
                pd.Timestamp("2011-04-09 10:07:30"),
                pd.Timestamp("2011-04-09 10:07:40"),
            ],
            "instance_id": [0, 0],
        },
    )

    feature_matrix_at_once = calculate_feature_matrix(
        [dfeat, agg_feat],
        es,
        approximate=Timedelta(10, "s"),
        cutoff_time=cutoff_df,
    )
    divided_matrices = []
    separate_cutoff = [cutoff_df.iloc[0:1], cutoff_df.iloc[1:]]
    # Make sure indexes are different
    # Note that this step is unnecessary and done to showcase the issue here
    separate_cutoff[0].index = [0]
    separate_cutoff[1].index = [1]
    for ct in separate_cutoff:
        fm = calculate_feature_matrix(
            [dfeat, agg_feat],
            es,
            approximate=Timedelta(10, "s"),
            cutoff_time=ct,
        )
        divided_matrices.append(fm)
    feature_matrix_from_split = pd.concat(divided_matrices)
    assert feature_matrix_from_split.shape == feature_matrix_at_once.shape
    for i1, i2 in zip(feature_matrix_at_once.index, feature_matrix_from_split.index):
        assert (pd.isnull(i1) and pd.isnull(i2)) or (i1 == i2)
    for c in feature_matrix_from_split:
        for i1, i2 in zip(feature_matrix_at_once[c], feature_matrix_from_split[c]):
            assert (pd.isnull(i1) and pd.isnull(i2)) or (i1 == i2)


def test_approximate_returns_correct_empty_default_values(es):
    agg_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    dfeat = DirectFeature(agg_feat, "sessions")

    cutoff_df = pd.DataFrame(
        {
            "time": [
                pd.Timestamp("2011-04-08 11:00:00"),
                pd.Timestamp("2011-04-09 11:00:00"),
            ],
            "instance_id": [0, 0],
        },
    )

    fm = calculate_feature_matrix(
        [dfeat],
        es,
        approximate=Timedelta(10, "s"),
        cutoff_time=cutoff_df,
    )
    assert fm[dfeat.get_name()].tolist() == [0, 10]


def test_approximate_child_aggs_handled_correctly(es):
    agg_feat = Feature(
        es["customers"].ww["id"],
        parent_dataframe_name="régions",
        primitive=Count,
    )
    dfeat = DirectFeature(agg_feat, "customers")
    agg_feat_2 = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Sum,
    )
    cutoff_df = pd.DataFrame(
        {
            "time": [
                pd.Timestamp("2011-04-08 10:30:00"),
                pd.Timestamp("2011-04-09 10:30:06"),
            ],
            "instance_id": [0, 0],
        },
    )

    fm = calculate_feature_matrix(
        [dfeat],
        es,
        approximate=Timedelta(10, "s"),
        cutoff_time=cutoff_df,
    )
    fm_2 = calculate_feature_matrix(
        [dfeat, agg_feat_2],
        es,
        approximate=Timedelta(10, "s"),
        cutoff_time=cutoff_df,
    )
    assert fm[dfeat.get_name()].tolist() == [2, 3]
    assert fm_2[agg_feat_2.get_name()].tolist() == [0, 5]


def test_cutoff_time_naming(es):
    agg_feat = Feature(
        es["customers"].ww["id"],
        parent_dataframe_name="régions",
        primitive=Count,
    )
    dfeat = DirectFeature(agg_feat, "customers")
    cutoff_df = pd.DataFrame(
        {
            "time": [
                pd.Timestamp("2011-04-08 10:30:00"),
                pd.Timestamp("2011-04-09 10:30:06"),
            ],
            "instance_id": [0, 0],
        },
    )
    cutoff_df_index_name = cutoff_df.rename(columns={"instance_id": "id"})
    cutoff_df_wrong_index_name = cutoff_df.rename(columns={"instance_id": "wrong_id"})
    cutoff_df_wrong_time_name = cutoff_df.rename(columns={"time": "cutoff_time"})

    fm1 = calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df)
    fm2 = calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df_index_name)
    assert all((fm1 == fm2.values).values)

    error_text = (
        "Cutoff time DataFrame must contain a column with either the same name"
        ' as the target dataframe index or a column named "instance_id"'
    )
    with pytest.raises(AttributeError, match=error_text):
        calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df_wrong_index_name)

    time_error_text = (
        "Cutoff time DataFrame must contain a column with either the same name"
        ' as the target dataframe time_index or a column named "time"'
    )
    with pytest.raises(AttributeError, match=time_error_text):
        calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df_wrong_time_name)


def test_cutoff_time_extra_columns(es):
    agg_feat = Feature(
        es["customers"].ww["id"],
        parent_dataframe_name="régions",
        primitive=Count,
    )
    dfeat = DirectFeature(agg_feat, "customers")

    cutoff_df = pd.DataFrame(
        {
            "time": [
                pd.Timestamp("2011-04-09 10:30:06"),
                pd.Timestamp("2011-04-09 10:30:03"),
                pd.Timestamp("2011-04-08 10:30:00"),
            ],
            "instance_id": [0, 1, 0],
            "label": [True, True, False],
        },
        columns=["time", "instance_id", "label"],
    )
    fm = calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df)
    # check column was added to end of matrix
    assert "label" == fm.columns[-1]

    assert (fm["label"].values == cutoff_df["label"].values).all()


def test_cutoff_time_extra_columns_approximate(es):
    agg_feat = Feature(
        es["customers"].ww["id"],
        parent_dataframe_name="régions",
        primitive=Count,
    )
    dfeat = DirectFeature(agg_feat, "customers")

    cutoff_df = pd.DataFrame(
        {
            "time": [
                pd.Timestamp("2011-04-09 10:30:06"),
                pd.Timestamp("2011-04-09 10:30:03"),
                pd.Timestamp("2011-04-08 10:30:00"),
            ],
            "instance_id": [0, 1, 0],
            "label": [True, True, False],
        },
        columns=["time", "instance_id", "label"],
    )
    fm = calculate_feature_matrix(
        [dfeat],
        es,
        cutoff_time=cutoff_df,
        approximate="2 days",
    )
    # check column was added to end of matrix
    assert "label" in fm.columns

    assert (fm["label"].values == cutoff_df["label"].values).all()


def test_cutoff_time_extra_columns_same_name(es):
    agg_feat = Feature(
        es["customers"].ww["id"],
        parent_dataframe_name="régions",
        primitive=Count,
    )
    dfeat = DirectFeature(agg_feat, "customers")

    cutoff_df = pd.DataFrame(
        {
            "time": [
                pd.Timestamp("2011-04-09 10:30:06"),
                pd.Timestamp("2011-04-09 10:30:03"),
                pd.Timestamp("2011-04-08 10:30:00"),
            ],
            "instance_id": [0, 1, 0],
            "régions.COUNT(customers)": [False, False, True],
        },
        columns=["time", "instance_id", "régions.COUNT(customers)"],
    )
    fm = calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df)

    assert (
        fm["régions.COUNT(customers)"].values
        == cutoff_df["régions.COUNT(customers)"].values
    ).all()


def test_cutoff_time_extra_columns_same_name_approximate(es):
    agg_feat = Feature(
        es["customers"].ww["id"],
        parent_dataframe_name="régions",
        primitive=Count,
    )
    dfeat = DirectFeature(agg_feat, "customers")

    cutoff_df = pd.DataFrame(
        {
            "time": [
                pd.Timestamp("2011-04-09 10:30:06"),
                pd.Timestamp("2011-04-09 10:30:03"),
                pd.Timestamp("2011-04-08 10:30:00"),
            ],
            "instance_id": [0, 1, 0],
            "régions.COUNT(customers)": [False, False, True],
        },
        columns=["time", "instance_id", "régions.COUNT(customers)"],
    )
    fm = calculate_feature_matrix(
        [dfeat],
        es,
        cutoff_time=cutoff_df,
        approximate="2 days",
    )

    assert (
        fm["régions.COUNT(customers)"].values
        == cutoff_df["régions.COUNT(customers)"].values
    ).all()


def test_instances_after_cutoff_time_removed(es):
    property_feature = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    cutoff_time = datetime(2011, 4, 8)
    fm = calculate_feature_matrix(
        [property_feature],
        es,
        cutoff_time=cutoff_time,
        cutoff_time_in_index=True,
    )
    actual_ids = (
        [id for (id, _) in fm.index]
        if isinstance(fm.index, pd.MultiIndex)
        else fm.index
    )

    # Customer with id 1 should be removed
    assert set(actual_ids) == set([2, 0])


def test_instances_with_id_kept_after_cutoff(es):
    property_feature = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    cutoff_time = datetime(2011, 4, 8)
    fm = calculate_feature_matrix(
        [property_feature],
        es,
        instance_ids=[0, 1, 2],
        cutoff_time=cutoff_time,
        cutoff_time_in_index=True,
    )

    # Customer #1 is after cutoff, but since it is included in instance_ids it
    # should be kept.
    actual_ids = (
        [id for (id, _) in fm.index]
        if isinstance(fm.index, pd.MultiIndex)
        else fm.index
    )
    assert set(actual_ids) == set([0, 1, 2])


def test_cfm_returns_original_time_indexes(es):
    agg_feat = Feature(
        es["customers"].ww["id"],
        parent_dataframe_name="régions",
        primitive=Count,
    )
    dfeat = DirectFeature(agg_feat, "customers")
    cutoff_df = pd.DataFrame(
        {
            "time": [
                pd.Timestamp("2011-04-09 10:30:06"),
                pd.Timestamp("2011-04-09 10:30:03"),
                pd.Timestamp("2011-04-08 10:30:00"),
            ],
            "instance_id": [0, 1, 0],
        },
    )

    fm = calculate_feature_matrix(
        [dfeat],
        es,
        cutoff_time=cutoff_df,
        cutoff_time_in_index=True,
    )

    instance_level_vals = fm.index.get_level_values(0).values
    time_level_vals = fm.index.get_level_values(1).values

    assert (instance_level_vals == cutoff_df["instance_id"].values).all()
    assert (time_level_vals == cutoff_df["time"].values).all()


def test_cfm_returns_original_time_indexes_approximate(es):
    agg_feat = Feature(
        es["customers"].ww["id"],
        parent_dataframe_name="régions",
        primitive=Count,
    )
    dfeat = DirectFeature(agg_feat, "customers")
    agg_feat_2 = Feature(
        es["sessions"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    cutoff_df = pd.DataFrame(
        {
            "time": [
                pd.Timestamp("2011-04-09 10:30:06"),
                pd.Timestamp("2011-04-09 10:30:03"),
                pd.Timestamp("2011-04-08 10:30:00"),
            ],
            "instance_id": [0, 1, 0],
        },
    )
    # approximate, in different windows, no unapproximated aggs
    fm = calculate_feature_matrix(
        [dfeat],
        es,
        cutoff_time=cutoff_df,
        cutoff_time_in_index=True,
        approximate="1 m",
    )
    instance_level_vals = fm.index.get_level_values(0).values
    time_level_vals = fm.index.get_level_values(1).values
    assert (instance_level_vals == cutoff_df["instance_id"].values).all()
    assert (time_level_vals == cutoff_df["time"].values).all()

    # approximate, in different windows, unapproximated aggs
    fm = calculate_feature_matrix(
        [dfeat, agg_feat_2],
        es,
        cutoff_time=cutoff_df,
        cutoff_time_in_index=True,
        approximate="1 m",
    )
    instance_level_vals = fm.index.get_level_values(0).values
    time_level_vals = fm.index.get_level_values(1).values
    assert (instance_level_vals == cutoff_df["instance_id"].values).all()
    assert (time_level_vals == cutoff_df["time"].values).all()

    # approximate, in same window, no unapproximated aggs
    fm2 = calculate_feature_matrix(
        [dfeat],
        es,
        cutoff_time=cutoff_df,
        cutoff_time_in_index=True,
        approximate="2 d",
    )
    instance_level_vals = fm2.index.get_level_values(0).values
    time_level_vals = fm2.index.get_level_values(1).values
    assert (instance_level_vals == cutoff_df["instance_id"].values).all()
    assert (time_level_vals == cutoff_df["time"].values).all()

    # approximate, in same window, unapproximated aggs
    fm3 = calculate_feature_matrix(
        [dfeat, agg_feat_2],
        es,
        cutoff_time=cutoff_df,
        cutoff_time_in_index=True,
        approximate="2 d",
    )
    instance_level_vals = fm3.index.get_level_values(0).values
    time_level_vals = fm3.index.get_level_values(1).values
    assert (instance_level_vals == cutoff_df["instance_id"].values).all()
    assert (time_level_vals == cutoff_df["time"].values).all()


def test_dask_kwargs(es, dask_cluster):
    times = (
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]
        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]
        + [datetime(2011, 4, 9, 10, 40, 0)]
        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]
        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]
        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)]
    )
    labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2
    cutoff_time = pd.DataFrame({"time": times, "instance_id": range(17)})
    property_feature = IdentityFeature(es["log"].ww["value"]) > 10

    dkwargs = {"cluster": dask_cluster.scheduler.address}
    feature_matrix = calculate_feature_matrix(
        [property_feature],
        entityset=es,
        cutoff_time=cutoff_time,
        verbose=True,
        chunk_size=0.13,
        dask_kwargs=dkwargs,
        approximate="1 hour",
    )

    assert (feature_matrix[property_feature.get_name()] == labels).values.all()


def test_dask_persisted_es(es, capsys, dask_cluster):
    times = (
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]
        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]
        + [datetime(2011, 4, 9, 10, 40, 0)]
        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]
        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]
        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)]
    )
    labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2
    cutoff_time = pd.DataFrame({"time": times, "instance_id": range(17)})
    property_feature = IdentityFeature(es["log"].ww["value"]) > 10

    dkwargs = {"cluster": dask_cluster.scheduler.address}
    feature_matrix = calculate_feature_matrix(
        [property_feature],
        entityset=es,
        cutoff_time=cutoff_time,
        verbose=True,
        chunk_size=0.13,
        dask_kwargs=dkwargs,
        approximate="1 hour",
    )
    assert (feature_matrix[property_feature.get_name()] == labels).values.all()
    feature_matrix = calculate_feature_matrix(
        [property_feature],
        entityset=es,
        cutoff_time=cutoff_time,
        verbose=True,
        chunk_size=0.13,
        dask_kwargs=dkwargs,
        approximate="1 hour",
    )
    captured = capsys.readouterr()
    assert "Using EntitySet persisted on the cluster as dataset " in captured[0]
    assert (feature_matrix[property_feature.get_name()] == labels).values.all()


class TestCreateClientAndCluster(object):
    def test_user_cluster_as_string(self, monkeypatch):
        monkeypatch.setattr(utils, "get_client_cluster", get_mock_client_cluster)
        # cluster in dask_kwargs case
        client, cluster = create_client_and_cluster(
            n_jobs=2,
            dask_kwargs={"cluster": "tcp://127.0.0.1:54321"},
            entityset_size=1,
        )
        assert cluster == "tcp://127.0.0.1:54321"

    def test_cluster_creation(self, monkeypatch):
        total_memory = psutil.virtual_memory().total
        monkeypatch.setattr(utils, "get_client_cluster", get_mock_client_cluster)
        try:
            cpus = len(psutil.Process().cpu_affinity())
        except AttributeError:  # pragma: no cover
            cpus = psutil.cpu_count()

        # jobs < tasks case
        client, cluster = create_client_and_cluster(
            n_jobs=2,
            dask_kwargs={},
            entityset_size=1,
        )
        num_workers = min(cpus, 2)
        memory_limit = int(total_memory / float(num_workers))
        assert cluster == (min(cpus, 2), 1, None, memory_limit)
        # jobs > tasks case
        match = r".*workers requested, but only .* workers created"
        with pytest.warns(UserWarning, match=match) as record:
            client, cluster = create_client_and_cluster(
                n_jobs=1000,
                dask_kwargs={"diagnostics_port": 8789},
                entityset_size=1,
            )
        assert len(record) == 1

        num_workers = cpus
        memory_limit = int(total_memory / float(num_workers))
        assert cluster == (num_workers, 1, 8789, memory_limit)

        # dask_kwargs sets memory limit
        client, cluster = create_client_and_cluster(
            n_jobs=2,
            dask_kwargs={"diagnostics_port": 8789, "memory_limit": 1000},
            entityset_size=1,
        )
        num_workers = min(cpus, 2)
        assert cluster == (num_workers, 1, 8789, 1000)

    def test_not_enough_memory(self, monkeypatch):
        total_memory = psutil.virtual_memory().total
        monkeypatch.setattr(utils, "get_client_cluster", get_mock_client_cluster)
        # errors if not enough memory for each worker to store the entityset
        with pytest.raises(ValueError, match=""):
            create_client_and_cluster(
                n_jobs=1,
                dask_kwargs={},
                entityset_size=total_memory * 2,
            )

        # does not error even if worker memory is less than 2x entityset size
        create_client_and_cluster(
            n_jobs=1,
            dask_kwargs={},
            entityset_size=total_memory * 0.75,
        )


def test_parallel_failure_raises_correct_error(es):
    times = (
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]
        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]
        + [datetime(2011, 4, 9, 10, 40, 0)]
        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]
        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]
        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)]
    )
    cutoff_time = pd.DataFrame({"time": times, "instance_id": range(17)})
    property_feature = IdentityFeature(es["log"].ww["value"]) > 10

    error_text = "Need at least one worker"
    with pytest.raises(AssertionError, match=error_text):
        calculate_feature_matrix(
            [property_feature],
            entityset=es,
            cutoff_time=cutoff_time,
            verbose=True,
            chunk_size=0.13,
            n_jobs=0,
            approximate="1 hour",
        )


def test_warning_not_enough_chunks(
    es,
    capsys,
    three_worker_dask_cluster,
):  # pragma: no cover
    property_feature = IdentityFeature(es["log"].ww["value"]) > 10

    dkwargs = {"cluster": three_worker_dask_cluster.scheduler.address}
    calculate_feature_matrix(
        [property_feature],
        entityset=es,
        chunk_size=0.5,
        verbose=True,
        dask_kwargs=dkwargs,
    )

    captured = capsys.readouterr()
    pattern = r"Fewer chunks \([0-9]+\), than workers \([0-9]+\) consider reducing the chunk size"
    assert re.search(pattern, captured.out) is not None


def test_n_jobs():
    try:
        cpus = len(psutil.Process().cpu_affinity())
    except AttributeError:  # pragma: no cover
        cpus = psutil.cpu_count()

    assert n_jobs_to_workers(1) == 1
    assert n_jobs_to_workers(-1) == cpus
    assert n_jobs_to_workers(cpus) == cpus
    assert n_jobs_to_workers((cpus + 1) * -1) == 1
    if cpus > 1:
        assert n_jobs_to_workers(-2) == cpus - 1

    error_text = "Need at least one worker"
    with pytest.raises(AssertionError, match=error_text):
        n_jobs_to_workers(0)


def test_parallel_cutoff_time_column_pass_through(es, dask_cluster):
    times = (
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]
        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]
        + [datetime(2011, 4, 9, 10, 40, 0)]
        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]
        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]
        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)]
    )
    labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2
    cutoff_time = pd.DataFrame(
        {"time": times, "instance_id": range(17), "labels": labels},
    )
    property_feature = IdentityFeature(es["log"].ww["value"]) > 10

    dkwargs = {"cluster": dask_cluster.scheduler.address}
    feature_matrix = calculate_feature_matrix(
        [property_feature],
        entityset=es,
        cutoff_time=cutoff_time,
        verbose=True,
        dask_kwargs=dkwargs,
        approximate="1 hour",
    )

    assert (
        feature_matrix[property_feature.get_name()] == feature_matrix["labels"]
    ).values.all()


def test_integer_time_index(int_es):
    times = list(range(8, 18)) + list(range(19, 26))
    labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2
    cutoff_df = pd.DataFrame({"time": times, "instance_id": range(17)})
    property_feature = IdentityFeature(int_es["log"].ww["value"]) > 10

    feature_matrix = calculate_feature_matrix(
        [property_feature],
        int_es,
        cutoff_time=cutoff_df,
        cutoff_time_in_index=True,
    )

    time_level_vals = feature_matrix.index.get_level_values(1).values
    sorted_df = cutoff_df.sort_values(["time", "instance_id"], kind="mergesort")
    assert (time_level_vals == sorted_df["time"].values).all()
    assert (feature_matrix[property_feature.get_name()] == labels).values.all()


def test_integer_time_index_single_cutoff_value(int_es):
    labels = [False] * 3 + [True] * 2 + [False] * 4
    property_feature = IdentityFeature(int_es["log"].ww["value"]) > 10

    cutoff_times = [16, pd.Series([16])[0], 16.0, pd.Series([16.0])[0]]
    for cutoff_time in cutoff_times:
        feature_matrix = calculate_feature_matrix(
            [property_feature],
            int_es,
            cutoff_time=cutoff_time,
            cutoff_time_in_index=True,
        )
        time_level_vals = feature_matrix.index.get_level_values(1).values
        assert (time_level_vals == [16] * 9).all()
        assert (feature_matrix[property_feature.get_name()] == labels).values.all()


def test_integer_time_index_datetime_cutoffs(int_es):
    times = [datetime.now()] * 17
    cutoff_df = pd.DataFrame({"time": times, "instance_id": range(17)})
    property_feature = IdentityFeature(int_es["log"].ww["value"]) > 10

    error_text = (
        "cutoff_time times must be numeric: try casting via pd\\.to_numeric\\(\\)"
    )
    with pytest.raises(TypeError, match=error_text):
        calculate_feature_matrix(
            [property_feature],
            int_es,
            cutoff_time=cutoff_df,
            cutoff_time_in_index=True,
        )


def test_integer_time_index_passes_extra_columns(int_es):
    times = list(range(8, 18)) + list(range(19, 23)) + [25, 24, 23]
    labels = [False] * 3 + [True] * 2 + [False] * 9 + [False] * 2 + [True]
    instances = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16, 15, 14]
    cutoff_df = pd.DataFrame(
        {"time": times, "instance_id": instances, "labels": labels},
    )
    cutoff_df = cutoff_df[["time", "instance_id", "labels"]]
    property_feature = IdentityFeature(int_es["log"].ww["value"]) > 10

    fm = calculate_feature_matrix(
        [property_feature],
        int_es,
        cutoff_time=cutoff_df,
        cutoff_time_in_index=True,
    )
    assert (fm[property_feature.get_name()] == fm["labels"]).all()


def test_integer_time_index_mixed_cutoff(int_es):
    times_dt = list(range(8, 17)) + [datetime(2011, 1, 1), 19, 20, 21, 22, 25, 24, 23]
    labels = [False] * 3 + [True] * 2 + [False] * 9 + [False] * 2 + [True]
    instances = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16, 15, 14]
    cutoff_df = pd.DataFrame(
        {"time": times_dt, "instance_id": instances, "labels": labels},
    )
    cutoff_df = cutoff_df[["time", "instance_id", "labels"]]
    property_feature = IdentityFeature(int_es["log"].ww["value"]) > 10

    error_text = "cutoff_time times must be.*try casting via.*"
    with pytest.raises(TypeError, match=error_text):
        calculate_feature_matrix([property_feature], int_es, cutoff_time=cutoff_df)

    times_str = list(range(8, 17)) + ["foobar", 19, 20, 21, 22, 25, 24, 23]
    cutoff_df["time"] = times_str
    with pytest.raises(TypeError, match=error_text):
        calculate_feature_matrix([property_feature], int_es, cutoff_time=cutoff_df)

    times_date_str = list(range(8, 17)) + ["2018-04-02", 19, 20, 21, 22, 25, 24, 23]
    cutoff_df["time"] = times_date_str
    with pytest.raises(TypeError, match=error_text):
        calculate_feature_matrix([property_feature], int_es, cutoff_time=cutoff_df)

    times_int_str = [0, 1, 2, 3, 4, 5, "6", 7, 8, 9, 9, 10, 11, 12, 15, 14, 13]
    times_int_str = list(range(8, 17)) + ["17", 19, 20, 21, 22, 25, 24, 23]
    cutoff_df["time"] = times_int_str
    # calculate_feature_matrix should convert time column to ints successfully here
    with pytest.raises(TypeError, match=error_text):
        calculate_feature_matrix([property_feature], int_es, cutoff_time=cutoff_df)


def test_datetime_index_mixed_cutoff(es):
    times = list(
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]
        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]
        + [17]
        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]
        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]
        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)],
    )
    labels = [False] * 3 + [True] * 2 + [False] * 9 + [False] * 2 + [True]
    instances = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16, 15, 14]
    cutoff_df = pd.DataFrame(
        {"time": times, "instance_id": instances, "labels": labels},
    )
    cutoff_df = cutoff_df[["time", "instance_id", "labels"]]
    property_feature = IdentityFeature(es["log"].ww["value"]) > 10

    error_text = "cutoff_time times must be.*try casting via.*"
    with pytest.raises(TypeError, match=error_text):
        calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_df)

    times[9] = "foobar"
    cutoff_df["time"] = times
    with pytest.raises(TypeError, match=error_text):
        calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_df)

    times[9] = "17"
    cutoff_df["time"] = times
    with pytest.raises(TypeError, match=error_text):
        calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_df)


def test_no_data_for_cutoff_time(mock_customer):
    es = mock_customer
    cutoff_times = pd.DataFrame(
        {"customer_id": [4], "time": pd.Timestamp("2011-04-08 20:08:13")},
    )

    trans_per_session = Feature(
        es["transactions"].ww["transaction_id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    trans_per_customer = Feature(
        es["transactions"].ww["transaction_id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    max_count = Feature(
        trans_per_session,
        parent_dataframe_name="customers",
        primitive=Max,
    )
    features = [trans_per_customer, max_count]

    fm = calculate_feature_matrix(features, entityset=es, cutoff_time=cutoff_times)

    # due to default values for each primitive
    # count will be 0, but max will nan
    answer = pd.DataFrame(
        {
            trans_per_customer.get_name(): pd.Series([0], dtype="Int64"),
            max_count.get_name(): pd.Series([np.nan], dtype="float"),
        },
    )
    for column in fm.columns:
        pd.testing.assert_series_equal(
            fm[column],
            answer[column],
            check_index=False,
            check_names=False,
        )


def test_instances_not_in_data(es):
    last_instance = max(es["log"].index.values)
    instances = list(range(last_instance + 1, last_instance + 11))
    identity_feature = IdentityFeature(es["log"].ww["value"])
    property_feature = identity_feature > 10
    agg_feat = AggregationFeature(
        Feature(es["log"].ww["value"]),
        parent_dataframe_name="sessions",
        primitive=Max,
    )
    direct_feature = DirectFeature(agg_feat, "log")
    features = [identity_feature, property_feature, direct_feature]
    fm = calculate_feature_matrix(features, entityset=es, instance_ids=instances)
    assert all(fm.index.values == instances)
    for column in fm.columns:
        assert fm[column].isnull().all()

    fm = calculate_feature_matrix(
        features,
        entityset=es,
        instance_ids=instances,
        approximate="730 days",
    )
    assert all(fm.index.values == instances)
    for column in fm.columns:
        assert fm[column].isnull().all()


def test_some_instances_not_in_data(es):
    a_time = datetime(2011, 4, 10, 10, 41, 9)  # only valid data
    b_time = datetime(2011, 4, 10, 11, 10, 5)  # some missing data
    c_time = datetime(2011, 4, 10, 12, 0, 0)  # all missing data

    times = [a_time, b_time, a_time, a_time, b_time, b_time] + [c_time] * 4
    cutoff_time = pd.DataFrame({"instance_id": list(range(12, 22)), "time": times})
    identity_feature = IdentityFeature(es["log"].ww["value"])
    property_feature = identity_feature > 10
    agg_feat = AggregationFeature(
        Feature(es["log"].ww["value"]),
        parent_dataframe_name="sessions",
        primitive=Max,
    )
    direct_feature = DirectFeature(agg_feat, "log")
    features = [identity_feature, property_feature, direct_feature]
    fm = calculate_feature_matrix(features, entityset=es, cutoff_time=cutoff_time)
    ifeat_answer = pd.Series([0, 7, 14, np.nan] + [np.nan] * 6)
    prop_answer = pd.Series([0, 0, 1, pd.NA, 0] + [pd.NA] * 5, dtype="boolean")
    dfeat_answer = pd.Series([14, 14, 14, np.nan] + [np.nan] * 6)

    assert all(fm.index.values == cutoff_time["instance_id"].values)
    for x, y in zip(fm.columns, [ifeat_answer, prop_answer, dfeat_answer]):
        pd.testing.assert_series_equal(fm[x], y, check_index=False, check_names=False)

    fm = calculate_feature_matrix(
        features,
        entityset=es,
        cutoff_time=cutoff_time,
        approximate="5 seconds",
    )

    dfeat_answer[0] = 7  # approximate calculated before 14 appears
    dfeat_answer[2] = 7  # approximate calculated before 14 appears
    prop_answer[3] = False  # no_unapproximated_aggs code ignores cutoff time

    assert all(fm.index.values == cutoff_time["instance_id"].values)
    for x, y in zip(fm.columns, [ifeat_answer, prop_answer, dfeat_answer]):
        pd.testing.assert_series_equal(fm[x], y, check_index=False, check_names=False)


def test_missing_instances_with_categorical_index(es):
    instance_ids = ["coke zero", "car", 3, "taco clock"]
    features = dfs(
        entityset=es,
        target_dataframe_name="products",
        features_only=True,
    )

    fm = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=instance_ids,
    )
    assert fm.index.values.to_list() == instance_ids
    assert isinstance(fm.index, pd.CategoricalIndex)


def test_handle_chunk_size():
    total_size = 100

    # user provides no chunk size
    assert _handle_chunk_size(None, total_size) is None

    # user provides fractional size
    assert _handle_chunk_size(0.1, total_size) == total_size * 0.1
    assert _handle_chunk_size(0.001, total_size) == 1  # rounds up
    assert _handle_chunk_size(0.345, total_size) == 35  # rounds up

    # user provides absolute size
    assert _handle_chunk_size(1, total_size) == 1
    assert _handle_chunk_size(100, total_size) == 100
    assert isinstance(_handle_chunk_size(100.0, total_size), int)

    # test invalid cases
    with pytest.raises(AssertionError, match="Chunk size must be greater than 0"):
        _handle_chunk_size(0, total_size)

    with pytest.raises(AssertionError, match="Chunk size must be greater than 0"):
        _handle_chunk_size(-1, total_size)


def test_chunk_dataframe_groups():
    df = pd.DataFrame({"group": [1, 1, 1, 1, 2, 2, 3]})

    grouped = df.groupby("group")
    chunked_grouped = _chunk_dataframe_groups(grouped, 2)

    # test group larger than chunk size gets split up
    first = next(chunked_grouped)
    assert first[0] == 1 and first[1].shape[0] == 2
    second = next(chunked_grouped)
    assert second[0] == 1 and second[1].shape[0] == 2

    # test that equal to and less than chunk size stays together
    third = next(chunked_grouped)
    assert third[0] == 2 and third[1].shape[0] == 2
    fourth = next(chunked_grouped)
    assert fourth[0] == 3 and fourth[1].shape[0] == 1


def test_calls_progress_callback(mock_customer):
    class MockProgressCallback:
        def __init__(self):
            self.progress_history = []
            self.total_update = 0
            self.total_progress_percent = 0

        def __call__(self, update, progress_percent, time_elapsed):
            self.total_update += update
            self.total_progress_percent = progress_percent
            self.progress_history.append(progress_percent)

    mock_progress_callback = MockProgressCallback()

    es = mock_customer

    # make sure to calculate features that have different paths to same base feature
    trans_per_session = Feature(
        es["transactions"].ww["transaction_id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    trans_per_customer = Feature(
        es["transactions"].ww["transaction_id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    features = [trans_per_session, Feature(trans_per_customer, "sessions")]
    calculate_feature_matrix(
        features,
        entityset=es,
        progress_callback=mock_progress_callback,
    )

    # second to last entry is the last update from feature calculation
    assert np.isclose(
        mock_progress_callback.progress_history[-2],
        FEATURE_CALCULATION_PERCENTAGE * 100,
    )
    assert np.isclose(mock_progress_callback.total_update, 100.0)
    assert np.isclose(mock_progress_callback.total_progress_percent, 100.0)

    # test with cutoff time dataframe
    mock_progress_callback = MockProgressCallback()
    cutoff_time = pd.DataFrame(
        {
            "instance_id": [1, 2, 3],
            "time": [
                pd.to_datetime("2014-01-01 01:00:00"),
                pd.to_datetime("2014-01-01 02:00:00"),
                pd.to_datetime("2014-01-01 03:00:00"),
            ],
        },
    )

    calculate_feature_matrix(
        features,
        entityset=es,
        cutoff_time=cutoff_time,
        progress_callback=mock_progress_callback,
    )
    assert np.isclose(
        mock_progress_callback.progress_history[-2],
        FEATURE_CALCULATION_PERCENTAGE * 100,
    )
    assert np.isclose(mock_progress_callback.total_update, 100.0)
    assert np.isclose(mock_progress_callback.total_progress_percent, 100.0)


def test_calls_progress_callback_cluster(mock_customer, dask_cluster):
    class MockProgressCallback:
        def __init__(self):
            self.progress_history = []
            self.total_update = 0
            self.total_progress_percent = 0

        def __call__(self, update, progress_percent, time_elapsed):
            self.total_update += update
            self.total_progress_percent = progress_percent
            self.progress_history.append(progress_percent)

    mock_progress_callback = MockProgressCallback()

    trans_per_session = Feature(
        mock_customer["transactions"].ww["transaction_id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    trans_per_customer = Feature(
        mock_customer["transactions"].ww["transaction_id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    features = [trans_per_session, Feature(trans_per_customer, "sessions")]

    dkwargs = {"cluster": dask_cluster.scheduler.address}
    calculate_feature_matrix(
        features,
        entityset=mock_customer,
        progress_callback=mock_progress_callback,
        dask_kwargs=dkwargs,
    )

    assert np.isclose(mock_progress_callback.total_update, 100.0)
    assert np.isclose(mock_progress_callback.total_progress_percent, 100.0)


def test_closes_tqdm(es):
    class ErrorPrim(TransformPrimitive):
        """A primitive whose function raises an error"""

        name = "error_prim"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = "Numeric"

        def get_function(self):
            def error(s):
                raise RuntimeError("This primitive has errored")

            return error

    value = Feature(es["log"].ww["value"])
    property_feature = value > 10
    error_feature = Feature(value, primitive=ErrorPrim)

    calculate_feature_matrix([property_feature], es, verbose=True)

    assert len(tqdm._instances) == 0

    match = "This primitive has errored"
    with pytest.raises(RuntimeError, match=match):
        calculate_feature_matrix([value, error_feature], es, verbose=True)
    assert len(tqdm._instances) == 0


def test_approximate_with_single_cutoff_warns(es):
    features = dfs(
        entityset=es,
        target_dataframe_name="customers",
        features_only=True,
        ignore_dataframes=["cohorts"],
        agg_primitives=["sum"],
    )

    match = (
        "Using approximate with a single cutoff_time value or no cutoff_time "
        "provides no computational efficiency benefit"
    )
    # test warning with single cutoff time
    with pytest.warns(UserWarning, match=match):
        calculate_feature_matrix(
            features,
            es,
            cutoff_time=pd.to_datetime("2020-01-01"),
            approximate="1 day",
        )
    # test warning with no cutoff time
    with pytest.warns(UserWarning, match=match):
        calculate_feature_matrix(features, es, approximate="1 day")

    # check proper handling of approximate
    feature_matrix = calculate_feature_matrix(
        features,
        es,
        cutoff_time=pd.to_datetime("2011-04-09 10:31:30"),
        approximate="1 minute",
    )

    expected_values = [50, 50, 50]
    assert (feature_matrix["régions.SUM(log.value)"] == expected_values).values.all()


def test_calc_feature_matrix_with_cutoff_df_and_instance_ids(es):
    times = list(
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]
        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]
        + [datetime(2011, 4, 9, 10, 40, 0)]
        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]
        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]
        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)],
    )
    instances = range(17)
    cutoff_time = pd.DataFrame({"time": times, es["log"].ww.index: instances})
    labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2

    property_feature = Feature(es["log"].ww["value"]) > 10

    match = "Passing 'instance_ids' is valid only if 'cutoff_time' is a single value or None - ignoring"
    with pytest.warns(UserWarning, match=match):
        feature_matrix = calculate_feature_matrix(
            [property_feature],
            es,
            cutoff_time=cutoff_time,
            instance_ids=[1, 3, 5],
            verbose=True,
        )

    assert (feature_matrix[property_feature.get_name()] == labels).values.all()


def test_calculate_feature_matrix_returns_default_values(default_value_es):
    sum_features = Feature(
        default_value_es["transactions"].ww["value"],
        parent_dataframe_name="sessions",
        primitive=Sum,
    )
    sessions_sum = Feature(sum_features, "transactions")

    feature_matrix = calculate_feature_matrix(
        features=[sessions_sum],
        entityset=default_value_es,
    )

    expected_values = [2.0, 2.0, 1.0, 0.0]

    assert (feature_matrix[sessions_sum.get_name()] == expected_values).values.all()


def test_dataframes_relationships(dataframes, relationships):
    fm_1, features = dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="transactions",
    )

    fm_2 = calculate_feature_matrix(
        features=features,
        dataframes=dataframes,
        relationships=relationships,
    )

    assert fm_1.equals(fm_2)


def test_no_dataframes(dataframes, relationships):
    features = dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="transactions",
        features_only=True,
    )

    msg = "No dataframes or valid EntitySet provided"
    with pytest.raises(TypeError, match=msg):
        calculate_feature_matrix(features=features, dataframes=None, relationships=None)


def test_no_relationships(dataframes):
    fm_1, features = dfs(
        dataframes=dataframes,
        relationships=None,
        target_dataframe_name="transactions",
    )

    fm_2 = calculate_feature_matrix(
        features=features,
        dataframes=dataframes,
        relationships=None,
    )

    assert fm_1.equals(fm_2)


def test_cfm_with_invalid_time_index(es):
    features = dfs(entityset=es, target_dataframe_name="customers", features_only=True)
    es["customers"].ww.set_types(logical_types={"signup_date": "integer"})
    match = "customers time index is numeric type "
    match += "which differs from other entityset time indexes"
    with pytest.raises(TypeError, match=match):
        calculate_feature_matrix(features=features, entityset=es)


def test_cfm_introduces_nan_values_in_direct_feats(es):
    es["customers"].ww.set_types(
        logical_types={"age": "Age", "engagement_level": "Integer"},
    )
    age_feat = Feature(es["customers"].ww["age"])
    engagement_feat = Feature(es["customers"].ww["engagement_level"])
    loves_ice_cream_feat = Feature(es["customers"].ww["loves_ice_cream"])
    features = [age_feat, engagement_feat, loves_ice_cream_feat]
    fm = calculate_feature_matrix(
        features=features,
        entityset=es,
        cutoff_time=pd.Timestamp("2010-04-08 04:00"),
        instance_ids=[1],
    )

    assert isinstance(es["customers"].ww.logical_types["age"], Age)
    assert isinstance(es["customers"].ww.logical_types["engagement_level"], Integer)
    assert isinstance(es["customers"].ww.logical_types["loves_ice_cream"], Boolean)

    assert isinstance(fm.ww.logical_types["age"], AgeNullable)
    assert isinstance(fm.ww.logical_types["engagement_level"], IntegerNullable)
    assert isinstance(fm.ww.logical_types["loves_ice_cream"], BooleanNullable)


def test_feature_origins_present_on_all_fm_cols(es):
    class MultiCumSum(TransformPrimitive):
        name = "multi_cum_sum"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        number_output_features = 3

        def get_function(self):
            def multi_cum_sum(x):
                return x.cumsum(), x.cummax(), x.cummin()

            return multi_cum_sum

    feature_matrix, _ = dfs(
        entityset=es,
        target_dataframe_name="log",
        trans_primitives=[MultiCumSum],
    )

    for col in feature_matrix.columns:
        origin = feature_matrix.ww[col].ww.origin
        assert origin in ["base", "engineered"]


def test_renamed_features_have_expected_column_names_in_feature_matrix(es):
    class MultiCumulative(TransformPrimitive):
        name = "multi_cum_sum"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        number_output_features = 3

        def get_function(self):
            def multi_cum_sum(x):
                return x.cumsum(), x.cummax(), x.cummin()

            return multi_cum_sum

    multi_output_trans_feat = Feature(
        es["log"].ww["value"],
        primitive=MultiCumulative,
    )
    groupby_trans_feat = GroupByTransformFeature(
        es["log"].ww["value"],
        primitive=MultiCumulative,
        groupby=es["log"].ww["product_id"],
    )
    multi_output_agg_feat = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="customers",
        primitive=NMostCommon(n=2),
    )
    slice = FeatureOutputSlice(multi_output_trans_feat, 1)
    stacked_feat = Feature(slice, primitive=Negate)

    multi_output_trans_names = ["cumulative_sum", "cumulative_max", "cumulative_min"]
    multi_output_trans_feat.set_feature_names(multi_output_trans_names)
    groupby_trans_feat_names = ["grouped_sum", "grouped_max", "grouped_min"]
    groupby_trans_feat.set_feature_names(groupby_trans_feat_names)
    agg_names = ["first_most_common", "second_most_common"]
    multi_output_agg_feat.set_feature_names(agg_names)

    features = [
        multi_output_trans_feat,
        multi_output_agg_feat,
        stacked_feat,
        groupby_trans_feat,
    ]
    feature_matrix = calculate_feature_matrix(entityset=es, features=features)
    expected_names = multi_output_trans_names + agg_names + groupby_trans_feat_names
    for renamed_col in expected_names:
        assert renamed_col in feature_matrix.columns

    expected_stacked_name = "-(cumulative_max)"
    assert expected_stacked_name in feature_matrix.columns


================================================
FILE: featuretools/tests/computational_backend/test_feature_set.py
================================================
from featuretools import (
    AggregationFeature,
    DirectFeature,
    IdentityFeature,
    TransformFeature,
    primitives,
)
from featuretools.computational_backends.feature_set import FeatureSet
from featuretools.entityset.relationship import RelationshipPath
from featuretools.tests.testing_utils import backward_path
from featuretools.utils import Trie


def test_feature_trie_without_needs_full_dataframe(diamond_es):
    es = diamond_es
    country_name = IdentityFeature(es["countries"].ww["name"])
    direct_name = DirectFeature(country_name, "regions")
    amount = IdentityFeature(es["transactions"].ww["amount"])

    path_through_customers = backward_path(es, ["regions", "customers", "transactions"])
    through_customers = AggregationFeature(
        amount,
        "regions",
        primitive=primitives.Mean,
        relationship_path=path_through_customers,
    )
    path_through_stores = backward_path(es, ["regions", "stores", "transactions"])
    through_stores = AggregationFeature(
        amount,
        "regions",
        primitive=primitives.Mean,
        relationship_path=path_through_stores,
    )
    customers_to_transactions = backward_path(es, ["customers", "transactions"])
    customers_mean = AggregationFeature(
        amount,
        "customers",
        primitive=primitives.Mean,
        relationship_path=customers_to_transactions,
    )

    negation = TransformFeature(customers_mean, primitives.Negate)
    regions_to_customers = backward_path(es, ["regions", "customers"])
    mean_of_mean = AggregationFeature(
        negation,
        "regions",
        primitive=primitives.Mean,
        relationship_path=regions_to_customers,
    )

    features = [direct_name, through_customers, through_stores, mean_of_mean]

    feature_set = FeatureSet(features)
    trie = feature_set.feature_trie

    assert trie.value == (False, set(), {f.unique_name() for f in features})
    assert trie.get_node(direct_name.relationship_path).value == (
        False,
        set(),
        {country_name.unique_name()},
    )
    assert trie.get_node(regions_to_customers).value == (
        False,
        set(),
        {negation.unique_name(), customers_mean.unique_name()},
    )
    regions_to_stores = backward_path(es, ["regions", "stores"])
    assert trie.get_node(regions_to_stores).value == (False, set(), set())
    assert trie.get_node(path_through_customers).value == (
        False,
        set(),
        {amount.unique_name()},
    )
    assert trie.get_node(path_through_stores).value == (
        False,
        set(),
        {amount.unique_name()},
    )


def test_feature_trie_with_needs_full_dataframe(diamond_es):
    es = diamond_es
    amount = IdentityFeature(es["transactions"].ww["amount"])

    path_through_customers = backward_path(
        es,
        ["regions", "customers", "transactions"],
    )
    agg = AggregationFeature(
        amount,
        "regions",
        primitive=primitives.Mean,
        relationship_path=path_through_customers,
    )
    trans_of_agg = TransformFeature(agg, primitives.CumSum)

    path_through_stores = backward_path(es, ["regions", "stores", "transactions"])
    trans = TransformFeature(amount, primitives.CumSum)
    agg_of_trans = AggregationFeature(
        trans,
        "regions",
        primitive=primitives.Mean,
        relationship_path=path_through_stores,
    )

    features = [agg, trans_of_agg, agg_of_trans]
    feature_set = FeatureSet(features)
    trie = feature_set.feature_trie

    assert trie.value == (
        True,
        {agg.unique_name(), trans_of_agg.unique_name()},
        {agg_of_trans.unique_name()},
    )
    assert trie.get_node(path_through_customers).value == (
        True,
        {amount.unique_name()},
        set(),
    )
    assert trie.get_node(path_through_customers[:1]).value == (True, set(), set())
    assert trie.get_node(path_through_stores).value == (
        True,
        {amount.unique_name(), trans.unique_name()},
        set(),
    )
    assert trie.get_node(path_through_stores[:1]).value == (False, set(), set())


def test_feature_trie_with_needs_full_dataframe_direct(es):
    value = IdentityFeature(es["log"].ww["value"])
    agg = AggregationFeature(value, "sessions", primitive=primitives.Mean)
    agg_of_agg = AggregationFeature(agg, "customers", primitive=primitives.Sum)
    direct = DirectFeature(agg_of_agg, "sessions")
    trans = TransformFeature(direct, primitives.CumSum)

    features = [trans, agg]
    feature_set = FeatureSet(features)
    trie = feature_set.feature_trie

    assert trie.value == (
        True,
        {direct.unique_name(), trans.unique_name()},
        {agg.unique_name()},
    )

    assert trie.get_node(agg.relationship_path).value == (
        False,
        set(),
        {value.unique_name()},
    )

    parent_node = trie.get_node(direct.relationship_path)
    assert parent_node.value == (True, {agg_of_agg.unique_name()}, set())

    child_through_parent_node = parent_node.get_node(agg_of_agg.relationship_path)
    assert child_through_parent_node.value == (True, {agg.unique_name()}, set())

    assert child_through_parent_node.get_node(agg.relationship_path).value == (
        True,
        {value.unique_name()},
        set(),
    )


def test_feature_trie_ignores_approximate_features(es):
    value = IdentityFeature(es["log"].ww["value"])
    agg = AggregationFeature(value, "sessions", primitive=primitives.Mean)
    agg_of_agg = AggregationFeature(agg, "customers", primitive=primitives.Sum)
    direct = DirectFeature(agg_of_agg, "sessions")
    features = [direct, agg]

    approximate_feature_trie = Trie(default=list, path_constructor=RelationshipPath)
    approximate_feature_trie.get_node(direct.relationship_path).value = [agg_of_agg]
    feature_set = FeatureSet(
        features,
        approximate_feature_trie=approximate_feature_trie,
    )
    trie = feature_set.feature_trie

    # Since agg_of_agg is ignored it and its dependencies should not be in the
    # trie.
    sub_trie = trie.get_node(direct.relationship_path)
    for _path, (_, _, features) in sub_trie:
        assert not features

    assert trie.value == (False, set(), {direct.unique_name(), agg.unique_name()})
    assert trie.get_node(agg.relationship_path).value == (
        False,
        set(),
        {value.unique_name()},
    )


================================================
FILE: featuretools/tests/computational_backend/test_feature_set_calculator.py
================================================
from datetime import datetime

import numpy as np
import pandas as pd
import pytest
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Categorical, Datetime, Double, Integer

from featuretools import (
    AggregationFeature,
    EntitySet,
    Feature,
    Timedelta,
    calculate_feature_matrix,
)
from featuretools.computational_backends.feature_set import FeatureSet
from featuretools.computational_backends.feature_set_calculator import (
    FeatureSetCalculator,
)
from featuretools.entityset.relationship import RelationshipPath
from featuretools.feature_base import DirectFeature, IdentityFeature
from featuretools.primitives import (
    And,
    Count,
    CumSum,
    EqualScalar,
    GreaterThanEqualToScalar,
    GreaterThanScalar,
    LessThanEqualToScalar,
    LessThanScalar,
    Mean,
    Min,
    Mode,
    Negate,
    NMostCommon,
    NotEqualScalar,
    NumTrue,
    Sum,
    TimeSinceLast,
    Trend,
)
from featuretools.primitives.base import AggregationPrimitive
from featuretools.primitives.standard.aggregation.num_unique import NumUnique
from featuretools.tests.testing_utils import backward_path
from featuretools.utils import Trie


def test_make_identity(es):
    f = IdentityFeature(es["log"].ww["datetime"])

    feature_set = FeatureSet([f])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))

    v = df[f.get_name()][0]
    assert v == datetime(2011, 4, 9, 10, 30, 0)


def test_make_dfeat(es):
    f = DirectFeature(
        Feature(es["customers"].ww["age"]),
        child_dataframe_name="sessions",
    )

    feature_set = FeatureSet([f])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))

    v = df[f.get_name()][0]
    assert v == 33


def test_make_agg_feat_of_identity_column(es):
    agg_feat = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="sessions",
        primitive=Sum,
    )

    feature_set = FeatureSet([agg_feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))

    v = df[agg_feat.get_name()][0]
    assert v == 50


def test_full_dataframe_trans_of_agg(es):
    agg_feat = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Sum,
    )
    trans_feat = Feature(agg_feat, primitive=CumSum)

    feature_set = FeatureSet([trans_feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([1]))

    v = df[trans_feat.get_name()].values[0]
    assert v == 82


def test_make_agg_feat_of_identity_index_column(es):
    agg_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )

    feature_set = FeatureSet([agg_feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))

    v = df[agg_feat.get_name()][0]
    assert v == 5


def test_make_agg_feat_where_count(es):
    agg_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        where=IdentityFeature(es["log"].ww["product_id"]) == "coke zero",
        primitive=Count,
    )

    feature_set = FeatureSet([agg_feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))

    v = df[agg_feat.get_name()][0]
    assert v == 3


def test_make_agg_feat_using_prev_time(es):
    agg_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        use_previous=Timedelta(10, "s"),
        primitive=Count,
    )

    feature_set = FeatureSet([agg_feat])
    calculator = FeatureSetCalculator(
        es,
        time_last=datetime(2011, 4, 9, 10, 30, 10),
        feature_set=feature_set,
    )
    df = calculator.run(np.array([0]))

    v = df[agg_feat.get_name()][0]
    assert v == 2

    calculator = FeatureSetCalculator(
        es,
        time_last=datetime(2011, 4, 9, 10, 30, 30),
        feature_set=feature_set,
    )
    df = calculator.run(np.array([0]))

    v = df[agg_feat.get_name()][0]
    assert v == 1


def test_make_agg_feat_using_prev_n_events(es):
    agg_feat_1 = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="sessions",
        use_previous=Timedelta(1, "observations"),
        primitive=Min,
    )

    agg_feat_2 = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="sessions",
        use_previous=Timedelta(3, "observations"),
        primitive=Min,
    )

    assert (
        agg_feat_1.get_name() != agg_feat_2.get_name()
    ), "Features should have different names based on use_previous"

    feature_set = FeatureSet([agg_feat_1, agg_feat_2])
    calculator = FeatureSetCalculator(
        es,
        time_last=datetime(2011, 4, 9, 10, 30, 6),
        feature_set=feature_set,
    )
    df = calculator.run(np.array([0]))

    # time_last is included by default
    v1 = df[agg_feat_1.get_name()][0]
    v2 = df[agg_feat_2.get_name()][0]
    assert v1 == 5
    assert v2 == 0

    calculator = FeatureSetCalculator(
        es,
        time_last=datetime(2011, 4, 9, 10, 30, 30),
        feature_set=feature_set,
    )
    df = calculator.run(np.array([0]))

    v1 = df[agg_feat_1.get_name()][0]
    v2 = df[agg_feat_2.get_name()][0]
    assert v1 == 20
    assert v2 == 10


def test_make_agg_feat_multiple_dtypes(es):
    compare_prod = IdentityFeature(es["log"].ww["product_id"]) == "coke zero"

    agg_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        where=compare_prod,
        primitive=Count,
    )

    agg_feat2 = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="sessions",
        where=compare_prod,
        primitive=Mode,
    )

    feature_set = FeatureSet([agg_feat, agg_feat2])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))

    v = df[agg_feat.get_name()][0]
    v2 = df[agg_feat2.get_name()][0]
    assert v == 3
    assert v2 == "coke zero"


def test_make_agg_feat_where_different_identity_feat(es):
    feats = []
    where_cmps = [
        LessThanScalar,
        GreaterThanScalar,
        LessThanEqualToScalar,
        GreaterThanEqualToScalar,
        EqualScalar,
        NotEqualScalar,
    ]
    for where_cmp in where_cmps:
        feats.append(
            Feature(
                es["log"].ww["id"],
                parent_dataframe_name="sessions",
                where=Feature(
                    es["log"].ww["value"],
                    primitive=where_cmp(10.0),
                ),
                primitive=Count,
            ),
        )

    df = calculate_feature_matrix(
        entityset=es,
        features=feats,
        instance_ids=[0, 1, 2, 3],
    )

    for i, where_cmp in enumerate(where_cmps):
        name = feats[i].get_name()
        instances = df[name]
        v0, v1, v2, v3 = instances[0:4]
        if where_cmp == LessThanScalar:
            assert v0 == 2
            assert v1 == 4
            assert v2 == 1
            assert v3 == 2
        elif where_cmp == GreaterThanScalar:
            assert v0 == 2
            assert v1 == 0
            assert v2 == 0
            assert v3 == 0
        elif where_cmp == LessThanEqualToScalar:
            assert v0 == 3
            assert v1 == 4
            assert v2 == 1
            assert v3 == 2
        elif where_cmp == GreaterThanEqualToScalar:
            assert v0 == 3
            assert v1 == 0
            assert v2 == 0
            assert v3 == 0
        elif where_cmp == EqualScalar:
            assert v0 == 1
            assert v1 == 0
            assert v2 == 0
            assert v3 == 0
        elif where_cmp == NotEqualScalar:
            assert v0 == 4
            assert v1 == 4
            assert v2 == 1
            assert v3 == 2


def test_make_agg_feat_of_grandchild_dataframe(es):
    agg_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )

    feature_set = FeatureSet([agg_feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))
    v = df[agg_feat.get_name()].values[0]
    assert v == 10


def test_make_agg_feat_where_count_feat(es):
    """
    Feature we're creating is:
    Number of sessions for each customer where the
    number of logs in the session is less than 3
    """
    log_count_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )

    feat = Feature(
        es["sessions"].ww["id"],
        parent_dataframe_name="customers",
        where=log_count_feat > 1,
        primitive=Count,
    )

    feature_set = FeatureSet([feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0, 1]))

    name = feat.get_name()
    instances = df[name]
    v0, v1 = instances[0:2]
    assert v0 == 2
    assert v1 == 2


def test_make_compare_feat(es):
    """
    Feature we're creating is:
    Number of sessions for each customer where the
    number of logs in the session is less than 3
    """
    log_count_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )

    mean_agg_feat = Feature(
        log_count_feat,
        parent_dataframe_name="customers",
        primitive=Mean,
    )

    mean_feat = DirectFeature(mean_agg_feat, child_dataframe_name="sessions")

    feat = log_count_feat > mean_feat

    feature_set = FeatureSet([feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0, 1, 2]))

    name = feat.get_name()
    instances = df[name]
    v0, v1, v2 = instances[0:3]
    assert v0
    assert v1
    assert not v2


def test_make_agg_feat_where_count_and_device_type_feat(es):
    """
    Feature we're creating is:
    Number of sessions for each customer where the
    number of logs in the session is less than 3
    """
    log_count_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )

    compare_count = log_count_feat == 1
    compare_device_type = IdentityFeature(es["sessions"].ww["device_type"]) == 1
    and_feat = Feature([compare_count, compare_device_type], primitive=And)
    feat = Feature(
        es["sessions"].ww["id"],
        parent_dataframe_name="customers",
        where=and_feat,
        primitive=Count,
    )

    feature_set = FeatureSet([feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))

    name = feat.get_name()
    instances = df[name]
    assert instances.values[0] == 1


def test_make_agg_feat_where_count_or_device_type_feat(es):
    """
    Feature we're creating is:
    Number of sessions for each customer where the
    number of logs in the session is less than 3
    """
    log_count_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )

    compare_count = log_count_feat > 1
    compare_device_type = IdentityFeature(es["sessions"].ww["device_type"]) == 1
    or_feat = compare_count.OR(compare_device_type)
    feat = Feature(
        es["sessions"].ww["id"],
        parent_dataframe_name="customers",
        where=or_feat,
        primitive=Count,
    )

    feature_set = FeatureSet([feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))

    name = feat.get_name()
    instances = df[name]
    assert instances.values[0] == 3


def test_make_agg_feat_of_agg_feat(es):
    log_count_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )

    customer_sum_feat = Feature(
        log_count_feat,
        parent_dataframe_name="customers",
        primitive=Sum,
    )

    feature_set = FeatureSet([customer_sum_feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))
    v = df[customer_sum_feat.get_name()].values[0]
    assert v == 10


@pytest.fixture
def df():
    return pd.DataFrame(
        {
            "id": ["a", "b", "c", "d", "e"],
            "e1": ["h", "h", "i", "i", "j"],
            "e2": ["x", "x", "y", "y", "x"],
            "e3": ["z", "z", "z", "z", "z"],
            "val": [1, 1, 1, 1, 1],
        },
    )


def test_make_3_stacked_agg_feats(df):
    """
    Tests stacking 3 agg features.

    The test specifically uses non numeric indices to test how ancestor columns are handled
    as dataframes are merged together

    """
    es = EntitySet()
    ltypes = {"e1": Categorical, "e2": Categorical, "e3": Categorical, "val": Double}
    es.add_dataframe(
        dataframe=df,
        index="id",
        dataframe_name="e0",
        logical_types=ltypes,
    )

    es.normalize_dataframe(
        base_dataframe_name="e0",
        new_dataframe_name="e1",
        index="e1",
        additional_columns=["e2", "e3"],
    )

    es.normalize_dataframe(
        base_dataframe_name="e1",
        new_dataframe_name="e2",
        index="e2",
        additional_columns=["e3"],
    )

    es.normalize_dataframe(
        base_dataframe_name="e2",
        new_dataframe_name="e3",
        index="e3",
    )

    sum_1 = Feature(es["e0"].ww["val"], parent_dataframe_name="e1", primitive=Sum)
    sum_2 = Feature(sum_1, parent_dataframe_name="e2", primitive=Sum)
    sum_3 = Feature(sum_2, parent_dataframe_name="e3", primitive=Sum)

    feature_set = FeatureSet([sum_3])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array(["z"]))
    v = df[sum_3.get_name()][0]
    assert v == 5


def test_make_dfeat_of_agg_feat_on_self(es):
    """
    The graph looks like this:

        R       R = Regions, a parent of customers
        |
        C       C = Customers, the dataframe we're trying to predict on
        |
       etc.

    We're trying to calculate a DFeat from C to R on an agg_feat of R on C.
    """
    customer_count_feat = Feature(
        es["customers"].ww["id"],
        parent_dataframe_name="régions",
        primitive=Count,
    )

    num_customers_feat = DirectFeature(
        customer_count_feat,
        child_dataframe_name="customers",
    )

    feature_set = FeatureSet([num_customers_feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))
    v = df[num_customers_feat.get_name()].values[0]
    assert v == 3


def test_make_dfeat_of_agg_feat_through_parent(es):
    """
    The graph looks like this:

        R       C = Customers, the dataframe we're trying to predict on
       / \\     R = Regions, a parent of customers
      S   C     S = Stores, a child of regions
          |
         etc.

    We're trying to calculate a DFeat from C to R on an agg_feat of R on S.
    """
    store_id_feat = IdentityFeature(es["stores"].ww["id"])

    store_count_feat = Feature(
        store_id_feat,
        parent_dataframe_name="régions",
        primitive=Count,
    )

    num_stores_feat = DirectFeature(store_count_feat, child_dataframe_name="customers")

    feature_set = FeatureSet([num_stores_feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))
    v = df[num_stores_feat.get_name()].values[0]
    assert v == 3


def test_make_deep_agg_feat_of_dfeat_of_agg_feat(es):
    """
    The graph looks like this (higher implies parent):

          C     C = Customers, the dataframe we're trying to predict on
          |     S = Sessions, a child of Customers
      P   S     L = Log, a child of both Sessions and Log
       \\ /     P = Products, a parent of Log which is not a descendent of customers
        L

    We're trying to calculate a DFeat from L to P on an agg_feat of P on L, and
    then aggregate it with another agg_feat of C on L.
    """
    log_count_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="products",
        primitive=Count,
    )

    product_purchases_feat = DirectFeature(log_count_feat, child_dataframe_name="log")

    purchase_popularity = Feature(
        product_purchases_feat,
        parent_dataframe_name="customers",
        primitive=Mean,
    )

    feature_set = FeatureSet([purchase_popularity])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))
    v = df[purchase_popularity.get_name()].values[0]
    assert v == 38.0 / 10.0


def test_deep_agg_feat_chain(es):
    """
    Agg feat of agg feat:
        region.Mean(customer.Count(Log))
    """
    customer_count_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )

    region_avg_feat = Feature(
        customer_count_feat,
        parent_dataframe_name="régions",
        primitive=Mean,
    )

    feature_set = FeatureSet([region_avg_feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array(["United States"]))

    v = df[region_avg_feat.get_name()][0]
    assert v == 17 / 3.0


def test_topn(es):
    topn = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="customers",
        primitive=NMostCommon(n=2),
    )
    feature_set = FeatureSet([topn])

    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0, 1, 2]))
    true_results = pd.DataFrame(
        [
            ["toothpaste", "coke zero"],
            ["coke zero", "Haribo sugar-free gummy bears"],
            ["taco clock", np.nan],
        ],
    )
    assert [name in df.columns for name in topn.get_feature_names()]

    for i in range(df.shape[0]):
        true = true_results.loc[i]
        actual = df.loc[i]
        if i == 0:
            # coke zero and toothpase have same number of occurrences
            assert set(true.values) == set(actual.values)
        else:
            for i1, i2 in zip(true, actual):
                assert (pd.isnull(i1) and pd.isnull(i2)) or (i1 == i2)


def test_trend(es):
    trend = Feature(
        [Feature(es["log"].ww["value"]), Feature(es["log"].ww["datetime"])],
        parent_dataframe_name="customers",
        primitive=Trend,
    )
    feature_set = FeatureSet([trend])

    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0, 1, 2]))

    true_results = [-0.812730, 4.870378, np.nan]

    np.testing.assert_almost_equal(
        df[trend.get_name()].tolist(),
        true_results,
        decimal=5,
    )


def test_direct_squared(es):
    feature = IdentityFeature(es["log"].ww["value"])
    squared = feature * feature
    feature_set = FeatureSet([feature, squared])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0, 1, 2]))
    for i, row in df.iterrows():
        assert (row[0] * row[0]) == row[1]


def test_agg_empty_child(es):
    customer_count_feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    feature_set = FeatureSet([customer_count_feat])

    # time last before the customer had any events, so child frame is empty
    calculator = FeatureSetCalculator(
        es,
        time_last=datetime(2011, 4, 8),
        feature_set=feature_set,
    )
    df = calculator.run(np.array([0]))

    assert df["COUNT(log)"].iloc[0] == 0


def test_diamond_entityset(diamond_es):
    es = diamond_es

    amount = IdentityFeature(es["transactions"].ww["amount"])
    path = backward_path(es, ["regions", "customers", "transactions"])
    through_customers = AggregationFeature(
        amount,
        "regions",
        primitive=Sum,
        relationship_path=path,
    )
    path = backward_path(es, ["regions", "stores", "transactions"])
    through_stores = AggregationFeature(
        amount,
        "regions",
        primitive=Sum,
        relationship_path=path,
    )

    feature_set = FeatureSet([through_customers, through_stores])
    calculator = FeatureSetCalculator(
        es,
        time_last=datetime(2011, 4, 8),
        feature_set=feature_set,
    )
    df = calculator.run(np.array([0, 1, 2]))

    assert (df["SUM(stores.transactions.amount)"] == [94, 261, 128]).all()
    assert (df["SUM(customers.transactions.amount)"] == [72, 411, 0]).all()


def test_two_relationships_to_single_dataframe(games_es):
    es = games_es
    home_team, away_team = es.relationships
    path = RelationshipPath([(False, home_team)])
    mean_at_home = AggregationFeature(
        Feature(es["games"].ww["home_team_score"]),
        "teams",
        relationship_path=path,
        primitive=Mean,
    )
    path = RelationshipPath([(False, away_team)])
    mean_at_away = AggregationFeature(
        Feature(es["games"].ww["away_team_score"]),
        "teams",
        relationship_path=path,
        primitive=Mean,
    )
    home_team_mean = DirectFeature(mean_at_home, "games", relationship=home_team)
    away_team_mean = DirectFeature(mean_at_away, "games", relationship=away_team)

    feature_set = FeatureSet([home_team_mean, away_team_mean])
    calculator = FeatureSetCalculator(
        es,
        time_last=datetime(2011, 8, 28),
        feature_set=feature_set,
    )
    df = calculator.run(np.array(range(3)))

    assert (df[home_team_mean.get_name()] == [1.5, 1.5, 2.5]).all()
    assert (df[away_team_mean.get_name()] == [1, 0.5, 2]).all()


@pytest.fixture
def parent_child():
    parent_df = pd.DataFrame({"id": [1]})
    child_df = pd.DataFrame(
        {
            "id": [1, 2, 3],
            "parent_id": [1, 1, 1],
            "time_index": pd.date_range(start="1/1/2018", periods=3),
            "value": [10, 5, 2],
            "cat": ["a", "a", "b"],
        },
    ).astype({"cat": "category"})
    return (parent_df, child_df)


def test_empty_child_dataframe(parent_child):
    parent_df, child_df = parent_child
    child_ltypes = {
        "parent_id": Integer,
        "time_index": Datetime,
        "value": Double,
        "cat": Categorical,
    }

    es = EntitySet(id="blah")
    es.add_dataframe(dataframe_name="parent", dataframe=parent_df, index="id")
    es.add_dataframe(
        dataframe_name="child",
        dataframe=child_df,
        index="id",
        time_index="time_index",
        logical_types=child_ltypes,
    )
    es.add_relationship("parent", "id", "child", "parent_id")

    # create regular agg
    count = Feature(
        es["child"].ww["id"],
        parent_dataframe_name="parent",
        primitive=Count,
    )

    # create agg feature that requires multiple arguments
    trend = Feature(
        [Feature(es["child"].ww["value"]), Feature(es["child"].ww["time_index"])],
        parent_dataframe_name="parent",
        primitive=Trend,
    )

    # create multi-output agg feature
    n_most_common = Feature(
        es["child"].ww["cat"],
        parent_dataframe_name="parent",
        primitive=NMostCommon,
    )

    # create aggs with where
    where = Feature(es["child"].ww["value"]) == 1
    count_where = Feature(
        es["child"].ww["id"],
        parent_dataframe_name="parent",
        where=where,
        primitive=Count,
    )
    trend_where = Feature(
        [Feature(es["child"].ww["value"]), Feature(es["child"].ww["time_index"])],
        parent_dataframe_name="parent",
        where=where,
        primitive=Trend,
    )
    n_most_common_where = Feature(
        es["child"].ww["cat"],
        parent_dataframe_name="parent",
        where=where,
        primitive=NMostCommon,
    )

    features = [
        count,
        count_where,
        trend,
        trend_where,
        n_most_common,
        n_most_common_where,
    ]
    data = {
        count.get_name(): pd.Series([0], dtype="Int64"),
        count_where.get_name(): pd.Series([0], dtype="Int64"),
        trend.get_name(): pd.Series([np.nan], dtype="float"),
        trend_where.get_name(): pd.Series([np.nan], dtype="float"),
    }
    for name in n_most_common.get_feature_names():
        data[name] = pd.Series([np.nan], dtype="category")
    for name in n_most_common_where.get_feature_names():
        data[name] = pd.Series([np.nan], dtype="category")

    answer = pd.DataFrame(data)

    # cutoff time before all rows
    fm = calculate_feature_matrix(
        entityset=es,
        features=features,
        cutoff_time=pd.Timestamp("12/31/2017"),
    )

    for column in data.keys():
        pd.testing.assert_series_equal(
            fm[column],
            answer[column],
            check_names=False,
            check_index=False,
        )

    # cutoff time after all rows, but where clause filters all rows
    data = {
        count_where.get_name(): pd.Series([0], dtype="Int64"),
        trend_where.get_name(): pd.Series([np.nan], dtype="float"),
    }
    for name in n_most_common_where.get_feature_names():
        data[name] = pd.Series([np.nan], dtype="category")
    answer = pd.DataFrame(data)

    for column in data.keys():
        pd.testing.assert_series_equal(
            fm[column],
            answer[column],
            check_names=False,
            check_index=False,
        )


def test_with_features_built_from_es_metadata(es):
    metadata = es.metadata

    agg_feat = Feature(
        metadata["log"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )

    feature_set = FeatureSet([agg_feat])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))
    v = df[agg_feat.get_name()].values[0]
    assert v == 10


def test_handles_primitive_function_name_uniqueness(es):
    class SumTimesN(AggregationPrimitive):
        name = "sum_times_n"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})

        def __init__(self, n):
            self.n = n

        def get_function(self):
            def my_function(values):
                return values.sum() * self.n

            return my_function

    # works as expected
    f1 = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=SumTimesN(n=1),
    )
    fm = calculate_feature_matrix(features=[f1], entityset=es)

    value_sum = pd.Series([56, 26, 0])
    assert all(fm[f1.get_name()].sort_index() == value_sum)

    # works as expected
    f2 = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=SumTimesN(n=2),
    )
    fm = calculate_feature_matrix(features=[f2], entityset=es)

    double_value_sum = pd.Series([112, 52, 0])
    assert all(fm[f2.get_name()].sort_index() == double_value_sum)

    # same primitive, same column, different args
    fm = calculate_feature_matrix(features=[f1, f2], entityset=es)

    assert all(fm[f1.get_name()].sort_index() == value_sum)
    assert all(fm[f2.get_name()].sort_index() == double_value_sum)

    # different primitives, same function returned by get_function,
    # different base features
    f3 = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Sum,
    )
    f4 = Feature(
        es["log"].ww["purchased"],
        parent_dataframe_name="customers",
        primitive=NumTrue,
    )
    fm = calculate_feature_matrix(features=[f3, f4], entityset=es)

    purchased_sum = pd.Series([10, 1, 1])
    assert all(fm[f3.get_name()].sort_index() == value_sum)
    assert all(fm[f4.get_name()].sort_index() == purchased_sum)

    # different primitives, same function returned by get_function,
    # same base feature
    class Sum1(AggregationPrimitive):
        """Sums elements of a numeric or boolean feature."""

        name = "sum1"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        stack_on_self = False
        stack_on_exclude = [Count]
        default_value = 0

        def get_function(self):
            return np.sum

    class Sum2(AggregationPrimitive):
        """Sums elements of a numeric or boolean feature."""

        name = "sum2"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        stack_on_self = False
        stack_on_exclude = [Count]
        default_value = 0

        def get_function(self):
            return np.sum

    class Sum3(AggregationPrimitive):
        """Sums elements of a numeric or boolean feature."""

        name = "sum3"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        stack_on_self = False
        stack_on_exclude = [Count]
        default_value = 0

        def get_function(self):
            return np.sum

    f5 = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Sum1,
    )
    f6 = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Sum2,
    )
    f7 = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Sum3,
    )
    fm = calculate_feature_matrix(features=[f5, f6, f7], entityset=es)
    assert all(fm[f5.get_name()].sort_index() == value_sum)
    assert all(fm[f6.get_name()].sort_index() == value_sum)
    assert all(fm[f7.get_name()].sort_index() == value_sum)


def test_returns_order_of_instance_ids(es):
    feature_set = FeatureSet([Feature(es["customers"].ww["age"])])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)

    instance_ids = [0, 1, 2]
    assert list(es["customers"]["id"]) != instance_ids

    df = calculator.run(np.array(instance_ids))

    assert list(df.index) == instance_ids


def test_calls_progress_callback(es):
    # call with all feature types. make sure progress callback calls sum to 1
    identity = Feature(es["customers"].ww["age"])
    direct = Feature(es["cohorts"].ww["cohort_name"], "customers")
    agg = Feature(
        es["sessions"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    agg_apply = Feature(
        es["log"].ww["datetime"],
        parent_dataframe_name="customers",
        primitive=TimeSinceLast,
    )  # this feature is handle differently than simple features
    trans = Feature(agg, primitive=Negate)
    trans_full = Feature(agg, primitive=CumSum)
    groupby_trans = Feature(
        agg,
        primitive=CumSum,
        groupby=Feature(es["customers"].ww["cohort"]),
    )

    all_features = [
        identity,
        direct,
        agg,
        agg_apply,
        trans,
        trans_full,
        groupby_trans,
    ]

    feature_set = FeatureSet(all_features)
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)

    class MockProgressCallback:
        def __init__(self):
            self.total = 0

        def __call__(self, update):
            self.total += update

    mock_progress_callback = MockProgressCallback()

    instance_ids = [0, 1, 2]
    calculator.run(np.array(instance_ids), mock_progress_callback)

    assert np.isclose(mock_progress_callback.total, 1)

    # testing again with a time_last with no data
    feature_set = FeatureSet(all_features)
    calculator = FeatureSetCalculator(
        es,
        time_last=pd.Timestamp("1950"),
        feature_set=feature_set,
    )

    mock_progress_callback = MockProgressCallback()
    calculator.run(np.array(instance_ids), mock_progress_callback)

    assert np.isclose(mock_progress_callback.total, 1)


# precalculated_features is only used with approximate
def test_precalculated_features(es):
    error_msg = (
        "This primitive should never be used because the features are precalculated"
    )

    class ErrorPrim(AggregationPrimitive):
        """A primitive whose function raises an error."""

        name = "error_prim"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})

        def get_function(self):
            def error(s):
                raise RuntimeError(error_msg)

            return error

    value = Feature(es["log"].ww["value"])
    agg = Feature(value, parent_dataframe_name="sessions", primitive=ErrorPrim)
    agg2 = Feature(agg, parent_dataframe_name="customers", primitive=ErrorPrim)
    direct = Feature(agg2, dataframe_name="sessions")

    # Set up a FeatureSet which knows which features are precalculated.
    precalculated_feature_trie = Trie(default=set, path_constructor=RelationshipPath)
    precalculated_feature_trie.get_node(direct.relationship_path).value.add(
        agg2.unique_name(),
    )
    feature_set = FeatureSet(
        [direct],
        approximate_feature_trie=precalculated_feature_trie,
    )

    # Fake precalculated data.
    values = [0, 1, 2]
    parent_fm = pd.DataFrame({agg2.get_name(): values})
    precalculated_fm_trie = Trie(path_constructor=RelationshipPath)
    precalculated_fm_trie.get_node(direct.relationship_path).value = parent_fm

    calculator = FeatureSetCalculator(
        es,
        feature_set=feature_set,
        precalculated_features=precalculated_fm_trie,
    )

    instance_ids = [0, 2, 3, 5]
    fm = calculator.run(np.array(instance_ids))

    assert list(fm[direct.get_name()]) == [values[0], values[0], values[1], values[2]]

    # Calculating without precalculated features should error.
    with pytest.raises(RuntimeError, match=error_msg):
        FeatureSetCalculator(es, feature_set=FeatureSet([direct])).run(instance_ids)


def test_nunique_nested_with_agg_bug(es):
    """Pandas 2.2.0 has a bug where pd.Series.nunique produces columns with
    the category dtype instead of int64 dtype, causing an error when we attempt
    another aggregation"""
    num_unique_feature = AggregationFeature(
        Feature(es["log"].ww["priority_level"]),
        "sessions",
        primitive=NumUnique,
    )

    mean_nunique_feature = AggregationFeature(
        num_unique_feature,
        "customers",
        primitive=Mean,
    )
    feature_set = FeatureSet([mean_nunique_feature])
    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)
    df = calculator.run(np.array([0]))

    assert df.iloc[0, 0].round(4) == 1.6667


================================================
FILE: featuretools/tests/computational_backend/test_utils.py
================================================
import numpy as np

from featuretools import dfs
from featuretools.computational_backends import replace_inf_values
from featuretools.primitives import DivideByFeature, DivideNumericScalar


def test_replace_inf_values(divide_by_zero_es):
    div_by_scalar = DivideNumericScalar(value=0)
    div_by_feature = DivideByFeature(value=1)
    div_by_feature_neg = DivideByFeature(value=-1)
    for primitive in [
        "divide_numeric",
        div_by_scalar,
        div_by_feature,
        div_by_feature_neg,
    ]:
        fm, _ = dfs(
            entityset=divide_by_zero_es,
            target_dataframe_name="zero",
            trans_primitives=[primitive],
            max_depth=1,
        )
        assert np.inf in fm.values or -np.inf in fm.values
        replaced_fm = replace_inf_values(fm)
        assert np.inf not in replaced_fm.values
        assert -np.inf not in replaced_fm.values

        custom_value_fm = replace_inf_values(fm, replacement_value="custom_val")
        assert np.inf not in custom_value_fm.values
        assert -np.inf not in replaced_fm.values
        assert "custom_val" in custom_value_fm.values


def test_replace_inf_values_specify_cols(divide_by_zero_es):
    div_by_scalar = DivideNumericScalar(value=0)
    fm, _ = dfs(
        entityset=divide_by_zero_es,
        target_dataframe_name="zero",
        trans_primitives=[div_by_scalar],
        max_depth=1,
    )

    assert np.inf in fm["col1 / 0"].values
    replaced_fm = replace_inf_values(fm, columns=["col1 / 0"])
    assert np.inf not in replaced_fm["col1 / 0"].values
    assert np.inf in replaced_fm["col2 / 0"].values


================================================
FILE: featuretools/tests/config_tests/__init__.py
================================================


================================================
FILE: featuretools/tests/config_tests/test_config.py
================================================
from featuretools import config


def test_get_default_config_does_not_change():
    old_config = config.get_all()

    key = "primitive_data_folder"
    value = "This is an example string"
    config.set({key: value})
    config.set_to_default()

    assert config.get(key) != value

    config.set(old_config)


def test_set_and_get_config():
    key = "primitive_data_folder"
    old_value = config.get(key)
    value = "This is an example string"

    config.set({key: value})
    assert config.get(key) == value

    config.set({key: old_value})


def test_get_all():
    assert config.get_all() == config._data


================================================
FILE: featuretools/tests/conftest.py
================================================
import contextlib
import copy
import os

import composeml as cp
import numpy as np
import pandas as pd
import pytest
from packaging.version import parse
from woodwork.column_schema import ColumnSchema

from featuretools import EntitySet, demo
from featuretools.primitives import AggregationPrimitive, TransformPrimitive
from featuretools.tests.testing_utils import make_ecommerce_entityset


@pytest.fixture()
def dask_cluster():
    distributed = pytest.importorskip(
        "distributed",
        reason="Dask not installed, skipping",
    )
    if distributed:
        with distributed.LocalCluster() as cluster:
            yield cluster


@pytest.fixture()
def three_worker_dask_cluster():
    distributed = pytest.importorskip(
        "distributed",
        reason="Dask not installed, skipping",
    )
    if distributed:
        with distributed.LocalCluster(n_workers=3) as cluster:
            yield cluster


@pytest.fixture(scope="session")
def make_es():
    return make_ecommerce_entityset()


@pytest.fixture(scope="session")
def make_int_es():
    return make_ecommerce_entityset(with_integer_time_index=True)


@pytest.fixture
def es(make_es):
    return copy.deepcopy(make_es)


@pytest.fixture
def int_es(make_int_es):
    return copy.deepcopy(make_int_es)


@pytest.fixture
def latlong_df():
    df = pd.DataFrame({"idx": [0, 1, 2], "latLong": [pd.NA, (1, 2), (pd.NA, pd.NA)]})
    return df


@pytest.fixture
def diamond_es():
    countries_df = pd.DataFrame({"id": range(2), "name": ["US", "Canada"]})
    regions_df = pd.DataFrame(
        {
            "id": range(3),
            "country_id": [0, 0, 1],
            "name": ["Northeast", "South", "Quebec"],
        },
    ).astype({"name": "category"})
    stores_df = pd.DataFrame(
        {
            "id": range(5),
            "region_id": [0, 1, 2, 2, 1],
            "square_ft": [2000, 3000, 1500, 2500, 2700],
        },
    )
    customers_df = pd.DataFrame(
        {
            "id": range(5),
            "region_id": [1, 0, 0, 1, 1],
            "name": ["A", "B", "C", "D", "E"],
        },
    )
    transactions_df = pd.DataFrame(
        {
            "id": range(8),
            "store_id": [4, 4, 2, 3, 4, 0, 1, 1],
            "customer_id": [3, 0, 2, 4, 3, 3, 2, 3],
            "amount": [100, 40, 45, 83, 13, 94, 27, 81],
        },
    )

    dataframes = {
        "countries": (countries_df, "id"),
        "regions": (regions_df, "id"),
        "stores": (stores_df, "id"),
        "customers": (customers_df, "id"),
        "transactions": (transactions_df, "id"),
    }
    relationships = [
        ("countries", "id", "regions", "country_id"),
        ("regions", "id", "stores", "region_id"),
        ("regions", "id", "customers", "region_id"),
        ("stores", "id", "transactions", "store_id"),
        ("customers", "id", "transactions", "customer_id"),
    ]
    return EntitySet(
        id="ecommerce_diamond",
        dataframes=dataframes,
        relationships=relationships,
    )


@pytest.fixture
def default_value_es():
    transactions = pd.DataFrame(
        {"id": [1, 2, 3, 4], "session_id": ["a", "a", "b", "c"], "value": [1, 1, 1, 1]},
    )

    sessions = pd.DataFrame({"id": ["a", "b"]})

    es = EntitySet()
    es.add_dataframe(dataframe_name="transactions", dataframe=transactions, index="id")
    es.add_dataframe(dataframe_name="sessions", dataframe=sessions, index="id")

    es.add_relationship("sessions", "id", "transactions", "session_id")
    return es


@pytest.fixture
def home_games_es():
    teams = pd.DataFrame({"id": range(3), "name": ["Breakers", "Spirit", "Thorns"]})
    games = pd.DataFrame(
        {
            "id": range(5),
            "home_team_id": [2, 2, 1, 0, 1],
            "away_team_id": [1, 0, 2, 1, 0],
            "home_team_score": [3, 0, 1, 0, 4],
            "away_team_score": [2, 1, 2, 0, 0],
        },
    )
    dataframes = {"teams": (teams, "id"), "games": (games, "id")}
    relationships = [("teams", "id", "games", "home_team_id")]
    return EntitySet(dataframes=dataframes, relationships=relationships)


@pytest.fixture
def games_es(home_games_es):
    return home_games_es.add_relationship("teams", "id", "games", "away_team_id")


@pytest.fixture
def mock_customer():
    return demo.load_mock_customer(return_entityset=True, random_seed=0)


@pytest.fixture
def lt(es):
    def label_func(df):
        return df["value"].sum() > 10

    kwargs = {
        "time_index": "datetime",
        "labeling_function": label_func,
        "window_size": "1m",
    }
    if parse(cp.__version__) >= parse("0.10.0"):
        kwargs["target_dataframe_index"] = "id"
    else:
        kwargs["target_dataframe_name"] = "id"  # pragma: no cover

    lm = cp.LabelMaker(**kwargs)

    df = es["log"]
    labels = lm.search(df, num_examples_per_instance=-1)
    labels = labels.rename(columns={"cutoff_time": "time"})
    return labels


@pytest.fixture
def dataframes():
    cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]})
    transactions_df = pd.DataFrame(
        {
            "id": [1, 2, 3, 4, 5, 6],
            "card_id": [1, 2, 1, 3, 4, 5],
            "transaction_time": [10, 12, 13, 20, 21, 20],
            "fraud": [True, False, False, False, True, True],
        },
    )
    dataframes = {
        "cards": (cards_df, "id"),
        "transactions": (transactions_df, "id", "transaction_time"),
    }
    return dataframes


@pytest.fixture
def relationships():
    return [("cards", "id", "transactions", "card_id")]


@pytest.fixture
def transform_es():
    # Create dataframe
    df = pd.DataFrame(
        {
            "a": [14, 12, 10],
            "b": [False, False, True],
            "b1": [True, True, False],
            "b12": [4, 5, 6],
            "P": [10, 15, 12],
        },
    )
    es = EntitySet(id="test")
    # Add dataframe to entityset
    es.add_dataframe(
        dataframe_name="first",
        dataframe=df,
        index="index",
        make_index=True,
    )

    return es


@pytest.fixture
def divide_by_zero_es():
    df = pd.DataFrame(
        {
            "id": [0, 1, 2, 3],
            "col1": [1, 0, -3, 4],
            "col2": [0, 0, 0, 4],
        },
    )
    return EntitySet("data", {"zero": (df, "id", None)})


@pytest.fixture
def window_series():
    return pd.Series(
        range(20),
        index=pd.date_range(start="2020-01-01", end="2020-01-20"),
    )


@pytest.fixture
def window_date_range():
    return pd.date_range(start="2022-11-1", end="2022-11-5", periods=30)


@pytest.fixture
def rolling_outlier_series():
    return pd.Series(
        [0] * 4 + [10] + [0] * 4 + [10] + [0] * 5,
        index=pd.date_range(start="2020-01-01", end="2020-01-15", periods=15),
    )


@pytest.fixture
def postal_code_dataframe():
    df = pd.DataFrame(
        {
            "string_dtype": pd.Series(["90210", "60018", "10010", "92304-4201"]),
            "int_dtype": pd.Series([10000, 20000, 30000]).astype("category"),
            "has_nulls": pd.Series([np.nan, 20000, 30000]).astype("category"),
        },
    )
    df.ww.init(
        logical_types={
            "string_dtype": "PostalCode",
            "int_dtype": "PostalCode",
            "has_nulls": "PostalCode",
        },
    )
    return df


def create_test_credentials(test_path):
    with open(test_path, "w+") as f:
        f.write("[test]\n")
        f.write("aws_access_key_id=AKIAIOSFODNN7EXAMPLE\n")
        f.write("aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\n")


def create_test_config(test_path_config):
    with open(test_path_config, "w+") as f:
        f.write("[profile test]\n")
        f.write("region=us-east-2\n")
        f.write("output=text\n")


@pytest.fixture
def setup_test_profile(monkeypatch, tmp_path):
    cache = tmp_path.joinpath(".cache")
    cache.mkdir()
    test_path = str(cache.joinpath("test_credentials"))
    test_path_config = str(cache.joinpath("test_config"))
    monkeypatch.setenv("AWS_SHARED_CREDENTIALS_FILE", test_path)
    monkeypatch.setenv("AWS_CONFIG_FILE", test_path_config)
    monkeypatch.delenv("AWS_ACCESS_KEY_ID", raising=False)
    monkeypatch.delenv("AWS_SECRET_ACCESS_KEY", raising=False)
    monkeypatch.setenv("AWS_PROFILE", "test")

    with contextlib.suppress(OSError):
        os.remove(test_path)
        os.remove(test_path_config)  # pragma: no cover

    create_test_credentials(test_path)
    create_test_config(test_path_config)
    yield
    os.remove(test_path)
    os.remove(test_path_config)


@pytest.fixture
def test_aggregation_primitive():
    class TestAgg(AggregationPrimitive):
        name = "test"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        stack_on = []

    return TestAgg


@pytest.fixture
def test_transform_primitive():
    class TestTransform(TransformPrimitive):
        name = "test"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        stack_on = []

    return TestTransform


@pytest.fixture
def strings_that_have_triggered_errors_before():
    return [
        "    ",
        '"This Borderlands game here"" is the perfect conclusion to the ""Borderlands 3"" line, which focuses on the fans ""favorite character and gives the players the opportunity to close for a long time some very important questions about\'s character and the memorable scenery with which the players interact.',
    ]


================================================
FILE: featuretools/tests/demo_tests/__init__.py
================================================


================================================
FILE: featuretools/tests/demo_tests/test_demo_data.py
================================================
import urllib.request

import pandas as pd
import pytest

from featuretools import EntitySet
from featuretools.demo import load_flight, load_mock_customer, load_retail, load_weather


@pytest.fixture(autouse=True)
def set_testing_headers():
    opener = urllib.request.build_opener()
    opener.addheaders = [("Testing", "True")]
    urllib.request.install_opener(opener)


def test_load_retail_diff():
    nrows = 10
    es_first = load_retail(nrows=nrows)
    assert isinstance(es_first, EntitySet)
    assert es_first["order_products"].shape[0] == nrows
    nrows_second = 11
    es_second = load_retail(nrows=nrows_second)
    assert es_second["order_products"].shape[0] == nrows_second


def test_mock_customer():
    n_customers = 4
    n_products = 3
    n_sessions = 30
    n_transactions = 400
    es = load_mock_customer(
        n_customers=n_customers,
        n_products=n_products,
        n_sessions=n_sessions,
        n_transactions=n_transactions,
        random_seed=0,
        return_entityset=True,
    )
    assert isinstance(es, EntitySet)
    df_names = [df.ww.name for df in es.dataframes]
    expected_names = ["transactions", "products", "sessions", "customers"]
    assert set(expected_names) == set(df_names)
    assert len(es["customers"]) == 4
    assert len(es["products"]) == 3
    assert len(es["sessions"]) == 30
    assert len(es["transactions"]) == 400


def test_load_flight():
    es = load_flight(
        month_filter=[1],
        categorical_filter={"origin_city": ["Charlotte, NC"]},
        return_single_table=False,
        nrows=1000,
    )
    assert isinstance(es, EntitySet)
    dataframe_names = ["airports", "flights", "trip_logs", "airlines"]
    realvals = [(11, 3), (13, 9), (103, 21), (1, 1)]
    for i, name in enumerate(dataframe_names):
        assert es[name].shape == realvals[i]


def test_weather():
    es = load_weather()
    assert isinstance(es, EntitySet)
    dataframe_names = ["temperatures"]
    realvals = [(3650, 3)]
    for i, name in enumerate(dataframe_names):
        assert es[name].shape == realvals[i]
    es = load_weather(return_single_table=True)
    assert isinstance(es, pd.DataFrame)


================================================
FILE: featuretools/tests/entityset_tests/__init__.py
================================================


================================================
FILE: featuretools/tests/entityset_tests/test_es.py
================================================
import copy
import logging
import pickle
import re
from datetime import datetime
from unittest.mock import patch

import numpy as np
import pandas as pd
import pytest
from woodwork.logical_types import (
    URL,
    Boolean,
    Categorical,
    CountryCode,
    Datetime,
    Double,
    EmailAddress,
    Integer,
    LatLong,
    NaturalLanguage,
    Ordinal,
    PostalCode,
    SubRegionCode,
)

from featuretools import Relationship
from featuretools.demo import load_retail
from featuretools.entityset import EntitySet
from featuretools.entityset.entityset import LTI_COLUMN_NAME, WW_SCHEMA_KEY
from featuretools.tests.testing_utils import get_df_tags


def test_normalize_time_index_as_additional_column(es):
    error_text = "Not moving signup_date as it is the base time index column. Perhaps, move the column to the copy_columns."
    with pytest.raises(ValueError, match=error_text):
        assert "signup_date" in es["customers"].columns
        es.normalize_dataframe(
            base_dataframe_name="customers",
            new_dataframe_name="cancellations",
            index="cancel_reason",
            make_time_index="signup_date",
            additional_columns=["signup_date"],
            copy_columns=[],
        )


def test_normalize_time_index_as_copy_column(es):
    assert "signup_date" in es["customers"].columns
    es.normalize_dataframe(
        base_dataframe_name="customers",
        new_dataframe_name="cancellations",
        index="cancel_reason",
        make_time_index="signup_date",
        additional_columns=[],
        copy_columns=["signup_date"],
    )

    assert "signup_date" in es["customers"].columns
    assert es["customers"].ww.time_index == "signup_date"
    assert "signup_date" in es["cancellations"].columns
    assert es["cancellations"].ww.time_index == "signup_date"


def test_normalize_time_index_as_copy_column_new_time_index(es):
    assert "signup_date" in es["customers"].columns
    es.normalize_dataframe(
        base_dataframe_name="customers",
        new_dataframe_name="cancellations",
        index="cancel_reason",
        make_time_index=True,
        additional_columns=[],
        copy_columns=["signup_date"],
    )

    assert "signup_date" in es["customers"].columns
    assert es["customers"].ww.time_index == "signup_date"
    assert "first_customers_time" in es["cancellations"].columns
    assert "signup_date" not in es["cancellations"].columns
    assert es["cancellations"].ww.time_index == "first_customers_time"


def test_normalize_time_index_as_copy_column_no_time_index(es):
    assert "signup_date" in es["customers"].columns
    es.normalize_dataframe(
        base_dataframe_name="customers",
        new_dataframe_name="cancellations",
        index="cancel_reason",
        make_time_index=False,
        additional_columns=[],
        copy_columns=["signup_date"],
    )

    assert "signup_date" in es["customers"].columns
    assert es["customers"].ww.time_index == "signup_date"
    assert "signup_date" in es["cancellations"].columns
    assert es["cancellations"].ww.time_index is None


def test_cannot_re_add_relationships_that_already_exists(es):
    warn_text = "Not adding duplicate relationship: " + str(es.relationships[0])
    before_len = len(es.relationships)
    rel = es.relationships[0]
    with pytest.warns(UserWarning, match=warn_text):
        es.add_relationship(relationship=rel)
    with pytest.warns(UserWarning, match=warn_text):
        es.add_relationship(
            rel._parent_dataframe_name,
            rel._parent_column_name,
            rel._child_dataframe_name,
            rel._child_column_name,
        )
    after_len = len(es.relationships)
    assert before_len == after_len


def test_add_relationships_convert_type(es):
    for r in es.relationships:
        parent_df = es[r.parent_dataframe.ww.name]
        child_df = es[r.child_dataframe.ww.name]
        assert parent_df.ww.index == r._parent_column_name
        assert "foreign_key" in r.child_column.ww.semantic_tags
        assert str(parent_df[r._parent_column_name].dtype) == str(
            child_df[r._child_column_name].dtype,
        )


def test_add_relationship_diff_param_logical_types(es):
    ordinal_1 = Ordinal(order=[0, 1, 2, 3, 4, 5, 6])
    ordinal_2 = Ordinal(order=[0, 1, 2, 3, 4, 5])
    es["sessions"].ww.set_types(logical_types={"id": ordinal_1})
    log_2_df = es["log"].copy()
    log_logical_types = {
        "id": Integer,
        "session_id": ordinal_2,
        "product_id": Categorical(),
        "datetime": Datetime,
        "value": Double,
        "value_2": Double,
        "latlong": LatLong,
        "latlong2": LatLong,
        "zipcode": PostalCode,
        "countrycode": CountryCode,
        "subregioncode": SubRegionCode,
        "value_many_nans": Double,
        "priority_level": Ordinal(order=[0, 1, 2]),
        "purchased": Boolean,
        "comments": NaturalLanguage,
        "url": URL,
        "email_address": EmailAddress,
    }
    log_semantic_tags = {"session_id": "foreign_key", "product_id": "foreign_key"}
    assert set(log_logical_types) == set(log_2_df.columns)
    es.add_dataframe(
        dataframe_name="log2",
        dataframe=log_2_df,
        index="id",
        logical_types=log_logical_types,
        semantic_tags=log_semantic_tags,
        time_index="datetime",
    )
    assert "log2" in es.dataframe_dict
    assert es["log2"].ww.schema is not None
    assert isinstance(es["log2"].ww.logical_types["session_id"], Ordinal)
    assert isinstance(es["sessions"].ww.logical_types["id"], Ordinal)
    assert (
        es["sessions"].ww.logical_types["id"]
        != es["log2"].ww.logical_types["session_id"]
    )

    warning_text = "Changing child logical type to match parent."
    with pytest.warns(UserWarning, match=warning_text):
        es.add_relationship("sessions", "id", "log2", "session_id")
    assert isinstance(es["log2"].ww.logical_types["product_id"], Categorical)
    assert isinstance(es["products"].ww.logical_types["id"], Categorical)


def test_add_relationship_different_logical_types_same_dtype(es):
    log_2_df = es["log"].copy()
    log_logical_types = {
        "id": Integer,
        "session_id": Integer,
        "product_id": CountryCode,
        "datetime": Datetime,
        "value": Double,
        "value_2": Double,
        "latlong": LatLong,
        "latlong2": LatLong,
        "zipcode": PostalCode,
        "countrycode": CountryCode,
        "subregioncode": SubRegionCode,
        "value_many_nans": Double,
        "priority_level": Ordinal(order=[0, 1, 2]),
        "purchased": Boolean,
        "comments": NaturalLanguage,
        "url": URL,
        "email_address": EmailAddress,
    }
    log_semantic_tags = {"session_id": "foreign_key", "product_id": "foreign_key"}
    assert set(log_logical_types) == set(log_2_df.columns)
    es.add_dataframe(
        dataframe_name="log2",
        dataframe=log_2_df,
        index="id",
        logical_types=log_logical_types,
        semantic_tags=log_semantic_tags,
        time_index="datetime",
    )
    assert "log2" in es.dataframe_dict
    assert es["log2"].ww.schema is not None
    assert isinstance(es["log2"].ww.logical_types["product_id"], CountryCode)
    assert isinstance(es["products"].ww.logical_types["id"], Categorical)

    warning_text = "Logical type CountryCode for child column product_id does not match parent column id logical type Categorical. Changing child logical type to match parent."
    with pytest.warns(UserWarning, match=warning_text):
        es.add_relationship("products", "id", "log2", "product_id")
    assert isinstance(es["log2"].ww.logical_types["product_id"], Categorical)
    assert isinstance(es["products"].ww.logical_types["id"], Categorical)
    assert "foreign_key" in es["log2"].ww.semantic_tags["product_id"]


def test_add_relationship_different_compatible_dtypes(es):
    log_2_df = es["log"].copy()
    log_logical_types = {
        "id": Integer,
        "session_id": Datetime,
        "product_id": Categorical,
        "datetime": Datetime,
        "value": Double,
        "value_2": Double,
        "latlong": LatLong,
        "latlong2": LatLong,
        "zipcode": PostalCode,
        "countrycode": CountryCode,
        "subregioncode": SubRegionCode,
        "value_many_nans": Double,
        "priority_level": Ordinal(order=[0, 1, 2]),
        "purchased": Boolean,
        "comments": NaturalLanguage,
        "url": URL,
        "email_address": EmailAddress,
    }
    log_semantic_tags = {"session_id": "foreign_key", "product_id": "foreign_key"}
    assert set(log_logical_types) == set(log_2_df.columns)
    es.add_dataframe(
        dataframe_name="log2",
        dataframe=log_2_df,
        index="id",
        logical_types=log_logical_types,
        semantic_tags=log_semantic_tags,
        time_index="datetime",
    )
    assert "log2" in es.dataframe_dict
    assert es["log2"].ww.schema is not None
    assert isinstance(es["log2"].ww.logical_types["session_id"], Datetime)
    assert isinstance(es["customers"].ww.logical_types["id"], Integer)

    warning_text = "Logical type Datetime for child column session_id does not match parent column id logical type Integer. Changing child logical type to match parent."
    with pytest.warns(UserWarning, match=warning_text):
        es.add_relationship("customers", "id", "log2", "session_id")
    assert isinstance(es["log2"].ww.logical_types["session_id"], Integer)
    assert isinstance(es["customers"].ww.logical_types["id"], Integer)


def test_add_relationship_errors_child_v_index(es):
    new_df = es["log"].ww.copy()
    new_df.ww._schema.name = "log2"
    es.add_dataframe(dataframe=new_df)

    to_match = "Unable to add relationship because child column 'id' in 'log2' is also its index"
    with pytest.raises(ValueError, match=to_match):
        es.add_relationship("log", "id", "log2", "id")


def test_add_relationship_empty_child_convert_dtype(es):
    relationship = Relationship(es, "sessions", "id", "log", "session_id")
    empty_log_df = pd.DataFrame(columns=es["log"].columns)

    es.add_dataframe(empty_log_df, "log")

    assert len(es["log"]) == 0
    # session_id will be Unknown logical type with dtype string
    assert es["log"]["session_id"].dtype == "string"

    es.relationships.remove(relationship)
    assert relationship not in es.relationships

    es.add_relationship(relationship=relationship)
    assert es["log"]["session_id"].dtype == "int64"


def test_add_relationship_with_relationship_object(es):
    relationship = Relationship(es, "sessions", "id", "log", "session_id")
    es.add_relationship(relationship=relationship)
    assert relationship in es.relationships


def test_add_relationships_with_relationship_object(es):
    relationships = [Relationship(es, "sessions", "id", "log", "session_id")]
    es.add_relationships(relationships)
    assert relationships[0] in es.relationships


def test_add_relationship_error(es):
    relationship = Relationship(es, "sessions", "id", "log", "session_id")
    error_message = (
        "Cannot specify dataframe and column name values and also supply a Relationship"
    )
    with pytest.raises(ValueError, match=error_message):
        es.add_relationship(parent_dataframe_name="sessions", relationship=relationship)


def test_query_by_values_returns_rows_in_given_order():
    data = pd.DataFrame(
        {
            "id": [1, 2, 3, 4, 5],
            "value": ["a", "c", "b", "a", "a"],
            "time": [1000, 2000, 3000, 4000, 5000],
        },
    )

    es = EntitySet()
    es = es.add_dataframe(
        dataframe=data,
        dataframe_name="test",
        index="id",
        time_index="time",
        logical_types={"value": "Categorical"},
    )
    query = es.query_by_values("test", ["b", "a"], column_name="value")
    assert np.array_equal(query["id"], [1, 3, 4, 5])


def test_query_by_values_secondary_time_index(es):
    end = np.datetime64(datetime(2011, 10, 1))
    all_instances = [0, 1, 2]
    result = es.query_by_values("customers", all_instances, time_last=end)

    for col in ["cancel_date", "cancel_reason"]:
        nulls = result.loc[all_instances][col].isnull() == [False, True, True]
        assert nulls.all(), "Some instance has data it shouldn't for column %s" % col


def test_query_by_id(es):
    df = es.query_by_values("log", instance_vals=[0])
    assert df["id"].values[0] == 0


def test_query_by_single_value(es):
    df = es.query_by_values("log", instance_vals=0)
    assert df["id"].values[0] == 0


def test_query_by_df(es):
    instance_df = pd.DataFrame({"id": [1, 3], "vals": [0, 1]})
    df = es.query_by_values("log", instance_vals=instance_df)

    assert np.array_equal(df["id"], [1, 3])


def test_query_by_id_with_time(es):
    df = es.query_by_values(
        dataframe_name="log",
        instance_vals=[0, 1, 2, 3, 4],
        time_last=datetime(2011, 4, 9, 10, 30, 2 * 6),
    )

    assert list(df["id"].values) == [0, 1, 2]


def test_query_by_column_with_time(es):
    df = es.query_by_values(
        dataframe_name="log",
        instance_vals=[0, 1, 2],
        column_name="session_id",
        time_last=datetime(2011, 4, 9, 10, 50, 0),
    )

    true_values = [i * 5 for i in range(5)] + [i * 1 for i in range(4)] + [0]

    assert list(df["id"].values) == list(range(10))
    assert list(df["value"].values) == true_values


def test_query_by_column_with_no_lti_and_training_window(es):
    match = (
        "Using training_window but last_time_index is not set for dataframe customers"
    )
    with pytest.warns(UserWarning, match=match):
        df = es.query_by_values(
            dataframe_name="customers",
            instance_vals=[0, 1, 2],
            column_name="cohort",
            time_last=datetime(2011, 4, 11),
            training_window="3d",
        )

    assert list(df["id"].values) == [1]
    assert list(df["age"].values) == [25]


def test_query_by_column_with_lti_and_training_window(es):
    es.add_last_time_indexes()
    df = es.query_by_values(
        dataframe_name="customers",
        instance_vals=[0, 1, 2],
        column_name="cohort",
        time_last=datetime(2011, 4, 11),
        training_window="3d",
    )
    df = df.reset_index(drop=True).sort_values("id")
    assert list(df["id"].values) == [0, 1, 2]
    assert list(df["age"].values) == [33, 25, 56]


def test_query_by_indexed_column(es):
    df = es.query_by_values(
        dataframe_name="log",
        instance_vals=["taco clock"],
        column_name="product_id",
    )
    df = df.reset_index(drop=True).sort_values("id")
    assert list(df["id"].values) == [15, 16]


@pytest.fixture
def df():
    return pd.DataFrame({"id": [0, 1, 2], "category": ["a", "b", "c"]})


def test_check_columns_and_dataframe(df):
    # matches
    logical_types = {"id": Integer, "category": Categorical}
    es = EntitySet(id="test")
    es.add_dataframe(
        df,
        dataframe_name="test_dataframe",
        index="id",
        logical_types=logical_types,
    )
    assert isinstance(
        es.dataframe_dict["test_dataframe"].ww.logical_types["category"],
        Categorical,
    )
    assert es.dataframe_dict["test_dataframe"].ww.semantic_tags["category"] == {
        "category",
    }


def test_make_index_any_location(df):
    logical_types = {"id": Integer, "category": Categorical}

    es = EntitySet(id="test")
    es.add_dataframe(
        dataframe_name="test_dataframe",
        index="id1",
        make_index=True,
        logical_types=logical_types,
        dataframe=df,
    )
    assert es.dataframe_dict["test_dataframe"].columns[0] == "id1"
    assert es.dataframe_dict["test_dataframe"].ww.index == "id1"


def test_replace_dataframe_and_create_index(es):
    df = pd.DataFrame({"ints": [3, 4, 5], "category": ["a", "b", "a"]})
    final_df = df.copy()
    final_df["id"] = [0, 1, 2]
    needs_idx_df = df.copy()

    logical_types = {"ints": Integer, "category": Categorical}
    es.add_dataframe(
        dataframe=df,
        dataframe_name="test_df",
        index="id",
        make_index=True,
        logical_types=logical_types,
    )

    assert es["test_df"].ww.index == "id"

    # DataFrame that needs the index column added
    assert "id" not in needs_idx_df.columns
    es.replace_dataframe("test_df", needs_idx_df)

    assert es["test_df"].ww.index == "id"
    df = es["test_df"].sort_values(by="id")
    assert all(df["id"] == final_df["id"])
    assert all(df["ints"] == final_df["ints"])


def test_replace_dataframe_created_index_present(es):
    df = pd.DataFrame({"ints": [3, 4, 5], "category": ["a", "b", "a"]})

    logical_types = {"ints": Integer, "category": Categorical}
    es.add_dataframe(
        dataframe=df,
        dataframe_name="test_df",
        index="id",
        make_index=True,
        logical_types=logical_types,
    )

    # DataFrame that already has the index column
    has_idx_df = es["test_df"].replace({0: 100})
    has_idx_df.set_index("id", drop=False, inplace=True)

    assert "id" in has_idx_df.columns

    es.replace_dataframe("test_df", has_idx_df)
    assert es["test_df"].ww.index == "id"
    df = es["test_df"].sort_values(by="ints")
    assert all(df["id"] == [100, 1, 2])


def test_index_any_location(df):
    logical_types = {"id": Integer, "category": Categorical}

    es = EntitySet(id="test")
    es.add_dataframe(
        dataframe_name="test_dataframe",
        index="category",
        logical_types=logical_types,
        dataframe=df,
    )
    assert es.dataframe_dict["test_dataframe"].columns[1] == "category"
    assert es.dataframe_dict["test_dataframe"].ww.index == "category"


def test_extra_column_type(df):
    # more columns
    logical_types = {"id": Integer, "category": Categorical, "category2": Categorical}

    error_text = re.escape(
        "logical_types contains columns that are not present in dataframe: ['category2']",
    )
    with pytest.raises(LookupError, match=error_text):
        es = EntitySet(id="test")
        es.add_dataframe(
            dataframe_name="test_dataframe",
            index="id",
            logical_types=logical_types,
            dataframe=df,
        )


def test_add_parent_not_index_column(es):
    error_text = "Parent column 'language' is not the index of dataframe régions"
    with pytest.raises(AttributeError, match=error_text):
        es.add_relationship("régions", "language", "customers", "région_id")


@pytest.fixture
def df2():
    return pd.DataFrame({"category": [1, 2, 3], "category2": ["1", "2", "3"]})


def test_none_index(df2):
    es = EntitySet(id="test")

    copy_df = df2.copy()
    copy_df.ww.init(name="test_dataframe")
    error_msg = "Cannot add Woodwork DataFrame to EntitySet without index"
    with pytest.raises(ValueError, match=error_msg):
        es.add_dataframe(dataframe=copy_df)

    warn_text = (
        "Using first column as index. To change this, specify the index parameter"
    )
    with pytest.warns(UserWarning, match=warn_text):
        es.add_dataframe(
            dataframe_name="test_dataframe",
            logical_types={"category": "Categorical"},
            dataframe=df2,
        )
    assert es["test_dataframe"].ww.index == "category"
    assert es["test_dataframe"].ww.semantic_tags["category"] == {"index"}
    assert isinstance(es["test_dataframe"].ww.logical_types["category"], Categorical)


@pytest.fixture
def df3():
    return pd.DataFrame({"category": [1, 2, 3]})


def test_unknown_index(df3):
    warn_text = "index id not found in dataframe, creating new integer column"
    es = EntitySet(id="test")
    with pytest.warns(UserWarning, match=warn_text):
        es.add_dataframe(
            dataframe_name="test_dataframe",
            dataframe=df3,
            index="id",
            logical_types={"category": "Categorical"},
        )
    assert es["test_dataframe"].ww.index == "id"
    assert list(es["test_dataframe"]["id"]) == list(
        range(3),
    )


def test_doesnt_remake_index(df):
    logical_types = {"id": "Integer", "category": "Categorical"}
    error_text = "Cannot make index: column with name id already present"
    with pytest.raises(RuntimeError, match=error_text):
        es = EntitySet(id="test")
        es.add_dataframe(
            dataframe_name="test_dataframe",
            index="id",
            make_index=True,
            dataframe=df,
            logical_types=logical_types,
        )


def test_bad_time_index_column(df3):
    logical_types = {"category": "Categorical"}
    error_text = "Specified time index column `time` not found in dataframe"
    with pytest.raises(LookupError, match=error_text):
        es = EntitySet(id="test")
        es.add_dataframe(
            dataframe_name="test_dataframe",
            dataframe=df3,
            index="category",
            time_index="time",
            logical_types=logical_types,
        )


@pytest.fixture
def df4():
    df = pd.DataFrame(
        {
            "id": [0, 1, 2],
            "category": ["a", "b", "a"],
            "category_int": [1, 2, 3],
            "ints": ["1", "2", "3"],
            "floats": ["1", "2", "3.0"],
        },
    )
    df["category_int"] = df["category_int"].astype("category")
    return df


def test_converts_dtype_on_init(df4):
    logical_types = {"id": Integer, "ints": Integer, "floats": Double}
    es = EntitySet(id="test")
    df4.ww.init(name="test_dataframe", index="id", logical_types=logical_types)
    es.add_dataframe(dataframe=df4)

    df = es["test_dataframe"]
    assert df["ints"].dtype.name == "int64"
    assert df["floats"].dtype.name == "float64"

    # this is infer from pandas dtype
    df = es["test_dataframe"]
    assert isinstance(df.ww.logical_types["category_int"], Categorical)


def test_converts_dtype_after_init(df4):
    category_dtype = "category"

    df4["category"] = df4["category"].astype(category_dtype)

    es = EntitySet(id="test")
    es.add_dataframe(
        dataframe_name="test_dataframe",
        index="id",
        dataframe=df4,
        logical_types=None,
    )
    df = es["test_dataframe"]

    df.ww.set_types(logical_types={"ints": "Integer"})
    assert isinstance(df.ww.logical_types["ints"], Integer)
    assert df["ints"].dtype == "int64"

    df.ww.set_types(logical_types={"ints": "Categorical"})
    assert isinstance(df.ww.logical_types["ints"], Categorical)
    assert df["ints"].dtype == category_dtype

    df.ww.set_types(logical_types={"ints": Ordinal(order=[1, 2, 3])})
    assert df.ww.logical_types["ints"] == Ordinal(order=[1, 2, 3])
    assert df["ints"].dtype == category_dtype

    df.ww.set_types(logical_types={"ints": "NaturalLanguage"})
    assert isinstance(df.ww.logical_types["ints"], NaturalLanguage)
    assert df["ints"].dtype == "string"


@pytest.fixture
def datetime1():
    times = pd.date_range("1/1/2011", periods=3, freq="H")
    time_strs = times.strftime("%Y-%m-%d")
    return pd.DataFrame({"id": [0, 1, 2], "time": time_strs})


def test_converts_datetime(datetime1):
    # string converts to datetime correctly
    # This test fails without defining logical types.
    # Entityset infers time column should be numeric type
    logical_types = {"id": Integer, "time": Datetime}

    es = EntitySet(id="test")
    es.add_dataframe(
        dataframe_name="test_dataframe",
        index="id",
        time_index="time",
        logical_types=logical_types,
        dataframe=datetime1,
    )
    pd_col = es["test_dataframe"]["time"]
    assert isinstance(es["test_dataframe"].ww.logical_types["time"], Datetime)
    assert type(pd_col[0]) == pd.Timestamp


@pytest.fixture
def datetime2():
    datetime_format = "%d-%m-%Y"
    actual = pd.Timestamp("Jan 2, 2011")
    time_strs = [actual.strftime(datetime_format)] * 3
    return pd.DataFrame(
        {"id": [0, 1, 2], "time_format": time_strs, "time_no_format": time_strs},
    )


def test_handles_datetime_format(datetime2):
    # check if we load according to the format string
    # pass in an ambiguous date
    datetime_format = "%d-%m-%Y"
    actual = pd.Timestamp("Jan 2, 2011")

    logical_types = {
        "id": Integer,
        "time_format": (Datetime(datetime_format=datetime_format)),
        "time_no_format": Datetime,
    }

    es = EntitySet(id="test")
    es.add_dataframe(
        dataframe_name="test_dataframe",
        index="id",
        logical_types=logical_types,
        dataframe=datetime2,
    )

    col_format = es["test_dataframe"]["time_format"]
    col_no_format = es["test_dataframe"]["time_no_format"]
    # without formatting pandas gets it wrong
    assert (col_no_format != actual).all()

    # with formatting we correctly get jan2
    assert (col_format == actual).all()


def test_handles_datetime_mismatch():
    # can't convert arbitrary strings
    df = pd.DataFrame({"id": [0, 1, 2], "time": ["a", "b", "tomorrow"]})
    logical_types = {"id": Integer, "time": Datetime}

    error_text = "Time index column must contain datetime or numeric values"
    with pytest.raises(TypeError, match=error_text):
        es = EntitySet(id="test")
        es.add_dataframe(
            df,
            dataframe_name="test_dataframe",
            index="id",
            time_index="time",
            logical_types=logical_types,
        )


def test_dataframe_init(es):
    df = pd.DataFrame(
        {
            "id": ["0", "1", "2"],
            "time": [datetime(2011, 4, 9, 10, 31, 3 * i) for i in range(3)],
            "category": ["a", "b", "a"],
            "number": [4, 5, 6],
        },
    )
    logical_types = {"id": Categorical, "time": Datetime}
    es.add_dataframe(
        df.copy(),
        dataframe_name="test_dataframe",
        index="id",
        time_index="time",
        logical_types=logical_types,
    )
    df_shape = df.shape

    es_df_shape = es["test_dataframe"].shape
    assert es_df_shape == df_shape
    assert es["test_dataframe"].ww.index == "id"
    assert es["test_dataframe"].ww.time_index == "time"
    assert set([v for v in es["test_dataframe"].ww.columns]) == set(df.columns)

    assert es["test_dataframe"]["time"].dtype == df["time"].dtype
    assert set(es["test_dataframe"]["id"]) == set(df["id"])


@pytest.fixture
def bad_df():
    return pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], 3: ["a", "b", "c"]})


def test_nonstr_column_names(bad_df):
    es = EntitySet(id="Failure")
    error_text = r"All column names must be strings \(Columns \[3\] are not strings\)"
    with pytest.raises(ValueError, match=error_text):
        es.add_dataframe(dataframe_name="str_cols", dataframe=bad_df, index="a")

    bad_df.ww.init()
    with pytest.raises(ValueError, match=error_text):
        es.add_dataframe(dataframe_name="str_cols", dataframe=bad_df)


def test_sort_time_id():
    transactions_df = pd.DataFrame(
        {
            "id": [1, 2, 3, 4, 5, 6],
            "transaction_time": pd.date_range(start="10:00", periods=6, freq="10s")[
                ::-1
            ],
        },
    )

    es = EntitySet(
        "test",
        dataframes={"t": (transactions_df.copy(), "id", "transaction_time")},
    )
    assert es["t"] is not transactions_df
    times = list(es["t"].transaction_time)
    assert times == sorted(list(transactions_df.transaction_time))


def test_already_sorted_parameter():
    transactions_df = pd.DataFrame(
        {
            "id": [1, 2, 3, 4, 5, 6],
            "transaction_time": [
                datetime(2014, 4, 6),
                datetime(2012, 4, 8),
                datetime(2012, 4, 8),
                datetime(2013, 4, 8),
                datetime(2015, 4, 8),
                datetime(2016, 4, 9),
            ],
        },
    )

    es = EntitySet(id="test")
    es.add_dataframe(
        transactions_df.copy(),
        dataframe_name="t",
        index="id",
        time_index="transaction_time",
        already_sorted=True,
    )

    assert es["t"] is not transactions_df
    times = list(es["t"].transaction_time)
    assert times == list(transactions_df.transaction_time)


def test_concat_not_inplace(es):
    first_es = copy.deepcopy(es)
    for df in first_es.dataframes:
        new_df = df.loc[[], :]
        first_es.replace_dataframe(df.ww.name, new_df)

    second_es = copy.deepcopy(es)

    # set the data description
    first_es.metadata

    new_es = first_es.concat(second_es)

    assert new_es == es
    assert new_es._data_description is None
    assert first_es._data_description is not None


def test_concat_inplace(es):
    first_es = copy.deepcopy(es)
    second_es = copy.deepcopy(es)
    for df in first_es.dataframes:
        new_df = df.loc[[], :]
        first_es.replace_dataframe(df.ww.name, new_df)

    # set the data description
    es.metadata

    es.concat(first_es, inplace=True)

    assert second_es == es
    assert es._data_description is None


def test_concat_with_lti(es):
    first_es = copy.deepcopy(es)
    for df in first_es.dataframes:
        new_df = df.loc[[], :]
        first_es.replace_dataframe(df.ww.name, new_df)

    second_es = copy.deepcopy(es)

    first_es.add_last_time_indexes()
    second_es.add_last_time_indexes()
    es.add_last_time_indexes()

    new_es = first_es.concat(second_es)

    assert new_es == es

    first_es["stores"].ww.pop(LTI_COLUMN_NAME)
    first_es["stores"].ww.metadata.pop("last_time_index")
    second_es["stores"].ww.pop(LTI_COLUMN_NAME)
    second_es["stores"].ww.metadata.pop("last_time_index")

    assert not first_es.__eq__(es, deep=False)
    assert not second_es.__eq__(es, deep=False)
    assert LTI_COLUMN_NAME not in first_es["stores"]
    assert LTI_COLUMN_NAME not in second_es["stores"]

    new_es = first_es.concat(second_es)

    assert new_es.__eq__(es, deep=True)
    # stores will get last time index re-added because it has children that will get lti calculated
    assert LTI_COLUMN_NAME in new_es["stores"]


def test_concat_errors(es):
    # entitysets are not equal
    copy_es = copy.deepcopy(es)
    copy_es["customers"].ww.pop("phone_number")

    error = (
        "Entitysets must have the same dataframes, relationships" ", and column names"
    )
    with pytest.raises(ValueError, match=error):
        es.concat(copy_es)


def test_concat_sort_index_with_time_index(es):
    # only pandas dataframes sort on the index and time index
    es1 = copy.deepcopy(es)
    es1.replace_dataframe(
        dataframe_name="customers",
        df=es["customers"].loc[[0, 1], :],
        already_sorted=True,
    )
    es2 = copy.deepcopy(es)
    es2.replace_dataframe(
        dataframe_name="customers",
        df=es["customers"].loc[[2], :],
        already_sorted=True,
    )

    combined_es_order_1 = es1.concat(es2)
    combined_es_order_2 = es2.concat(es1)

    assert list(combined_es_order_1["customers"].index) == [2, 0, 1]
    assert list(combined_es_order_2["customers"].index) == [2, 0, 1]
    assert combined_es_order_1.__eq__(es, deep=True)
    assert combined_es_order_2.__eq__(es, deep=True)
    assert combined_es_order_2.__eq__(combined_es_order_1, deep=True)


def test_concat_sort_index_without_time_index(es):
    # Sorting is only performed on DataFrames with time indices
    es1 = copy.deepcopy(es)
    es1.replace_dataframe(
        dataframe_name="products",
        df=es["products"].iloc[[0, 1, 2], :],
        already_sorted=True,
    )
    es2 = copy.deepcopy(es)
    es2.replace_dataframe(
        dataframe_name="products",
        df=es["products"].iloc[[3, 4, 5], :],
        already_sorted=True,
    )

    combined_es_order_1 = es1.concat(es2)
    combined_es_order_2 = es2.concat(es1)

    # order matters when we don't sort
    assert list(combined_es_order_1["products"].index) == [
        "Haribo sugar-free gummy bears",
        "car",
        "toothpaste",
        "brown bag",
        "coke zero",
        "taco clock",
    ]
    assert list(combined_es_order_2["products"].index) == [
        "brown bag",
        "coke zero",
        "taco clock",
        "Haribo sugar-free gummy bears",
        "car",
        "toothpaste",
    ]
    assert combined_es_order_1.__eq__(es, deep=True)
    assert not combined_es_order_2.__eq__(es, deep=True)
    assert combined_es_order_2.__eq__(es, deep=False)
    assert not combined_es_order_2.__eq__(combined_es_order_1, deep=True)


def test_concat_with_make_index(es):
    df = pd.DataFrame({"id": [0, 1, 2], "category": ["a", "b", "a"]})
    logical_types = {"id": Categorical, "category": Categorical}
    es.add_dataframe(
        dataframe=df,
        dataframe_name="test_df",
        index="id1",
        make_index=True,
        logical_types=logical_types,
    )

    es_1 = copy.deepcopy(es)
    es_2 = copy.deepcopy(es)

    assert es.__eq__(es_1, deep=True)
    assert es.__eq__(es_2, deep=True)

    # map of what rows to take from es_1 and es_2 for each dataframe
    emap = {
        "log": [list(range(10)) + [14, 15, 16], list(range(10, 14)) + [15, 16]],
        "sessions": [[0, 1, 2], [1, 3, 4, 5]],
        "customers": [[0, 2], [1, 2]],
        "test_df": [[0, 1], [0, 2]],
    }

    for i, _es in enumerate([es_1, es_2]):
        for df_name, rows in emap.items():
            df = _es[df_name]
            _es.replace_dataframe(dataframe_name=df_name, df=df.loc[rows[i]])

    assert es.__eq__(es_1, deep=False)
    assert es.__eq__(es_2, deep=False)
    assert not es.__eq__(es_1, deep=True)
    assert not es.__eq__(es_2, deep=True)

    old_es_1 = copy.deepcopy(es_1)
    old_es_2 = copy.deepcopy(es_2)
    es_3 = es_1.concat(es_2)

    assert old_es_1.__eq__(es_1, deep=True)
    assert old_es_2.__eq__(es_2, deep=True)

    assert es_3.__eq__(es, deep=True)


@pytest.fixture
def transactions_df():
    return pd.DataFrame(
        {
            "id": [1, 2, 3, 4, 5, 6],
            "card_id": [1, 2, 1, 3, 4, 5],
            "transaction_time": [10, 12, 13, 20, 21, 20],
            "fraud": [True, False, False, False, True, True],
        },
    )


def test_set_time_type_on_init(transactions_df):
    # create cards dataframe
    cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]})
    cards_logical_types = None
    transactions_logical_types = None
    dataframes = {
        "cards": (cards_df, "id", None, cards_logical_types),
        "transactions": (
            transactions_df,
            "id",
            "transaction_time",
            transactions_logical_types,
        ),
    }
    relationships = [("cards", "id", "transactions", "card_id")]
    es = EntitySet("fraud", dataframes, relationships)
    # assert time_type is set
    assert es.time_type == "numeric"


def test_sets_time_when_adding_dataframe(transactions_df):
    accounts_df = pd.DataFrame(
        {
            "id": [3, 4, 5],
            "signup_date": [
                datetime(2002, 5, 1),
                datetime(2006, 3, 20),
                datetime(2011, 11, 11),
            ],
        },
    )
    accounts_df_string = pd.DataFrame(
        {"id": [3, 4, 5], "signup_date": ["element", "exporting", "editable"]},
    )
    accounts_logical_types = None
    transactions_logical_types = None

    # create empty entityset
    es = EntitySet("fraud")
    # assert it's not set
    assert getattr(es, "time_type", None) is None
    # add dataframe
    es.add_dataframe(
        transactions_df,
        dataframe_name="transactions",
        index="id",
        time_index="transaction_time",
        logical_types=transactions_logical_types,
    )
    # assert time_type is set
    assert es.time_type == "numeric"
    # add another dataframe
    es.normalize_dataframe("transactions", "cards", "card_id", make_time_index=True)
    # assert time_type unchanged
    assert es.time_type == "numeric"
    # add wrong time type dataframe
    error_text = "accounts time index is Datetime type which differs from other entityset time indexes"
    with pytest.raises(TypeError, match=error_text):
        es.add_dataframe(
            accounts_df,
            dataframe_name="accounts",
            index="id",
            time_index="signup_date",
            logical_types=accounts_logical_types,
        )

    error_text = "Time index column must contain datetime or numeric values"
    with pytest.raises(TypeError, match=error_text):
        es.add_dataframe(
            accounts_df_string,
            dataframe_name="accounts",
            index="id",
            time_index="signup_date",
        )


def test_secondary_time_index_no_primary_time_index(es):
    es["products"].ww.set_types(logical_types={"rating": "Datetime"})
    assert es["products"].ww.time_index is None

    error = (
        "Cannot set secondary time index on a DataFrame that has no primary time index."
    )
    with pytest.raises(ValueError, match=error):
        es.set_secondary_time_index("products", {"rating": ["url"]})

    assert "secondary_time_index" not in es["products"].ww.metadata
    assert es["products"].ww.time_index is None


def test_set_non_valid_time_index_type(es):
    error_text = "Time index column must be a Datetime or numeric column."
    with pytest.raises(TypeError, match=error_text):
        es["log"].ww.set_time_index("purchased")


def test_checks_time_type_setting_secondary_time_index(es):
    # entityset is timestamp time type
    assert es.time_type == Datetime
    # add secondary index that is timestamp type
    new_2nd_ti = {
        "upgrade_date": ["upgrade_date", "favorite_quote"],
        "cancel_date": ["cancel_date", "cancel_reason"],
    }
    es.set_secondary_time_index("customers", new_2nd_ti)
    assert es.time_type == Datetime
    # add secondary index that is numeric type
    new_2nd_ti = {"age": ["age", "loves_ice_cream"]}

    error_text = "customers time index is numeric type which differs from other entityset time indexes"
    with pytest.raises(TypeError, match=error_text):
        es.set_secondary_time_index("customers", new_2nd_ti)
    # add secondary index that is non-time type
    new_2nd_ti = {"favorite_quote": ["favorite_quote", "loves_ice_cream"]}

    error_text = "customers time index not recognized as numeric or datetime"
    with pytest.raises(TypeError, match=error_text):
        es.set_secondary_time_index("customers", new_2nd_ti)
    # add mismatched pair of secondary time indexes
    new_2nd_ti = {
        "upgrade_date": ["upgrade_date", "favorite_quote"],
        "age": ["age", "loves_ice_cream"],
    }

    error_text = "customers time index is numeric type which differs from other entityset time indexes"
    with pytest.raises(TypeError, match=error_text):
        es.set_secondary_time_index("customers", new_2nd_ti)

    # create entityset with numeric time type
    cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]})
    transactions_df = pd.DataFrame(
        {
            "id": [1, 2, 3, 4, 5, 6],
            "card_id": [1, 2, 1, 3, 4, 5],
            "transaction_time": [10, 12, 13, 20, 21, 20],
            "fraud_decision_time": [11, 14, 15, 21, 22, 21],
            "transaction_city": ["City A"] * 6,
            "transaction_date": [datetime(1989, 2, i) for i in range(1, 7)],
            "fraud": [True, False, False, False, True, True],
        },
    )
    dataframes = {
        "cards": (cards_df, "id"),
        "transactions": (transactions_df, "id", "transaction_time"),
    }
    relationships = [("cards", "id", "transactions", "card_id")]
    card_es = EntitySet("fraud", dataframes, relationships)
    assert card_es.time_type == "numeric"
    # add secondary index that is numeric time type
    new_2nd_ti = {"fraud_decision_time": ["fraud_decision_time", "fraud"]}
    card_es.set_secondary_time_index("transactions", new_2nd_ti)
    assert card_es.time_type == "numeric"
    # add secondary index that is timestamp type
    new_2nd_ti = {"transaction_date": ["transaction_date", "fraud"]}

    error_text = "transactions time index is Datetime type which differs from other entityset time indexes"
    with pytest.raises(TypeError, match=error_text):
        card_es.set_secondary_time_index("transactions", new_2nd_ti)
    # add secondary index that is non-time type
    new_2nd_ti = {"transaction_city": ["transaction_city", "fraud"]}

    error_text = "transactions time index not recognized as numeric or datetime"
    with pytest.raises(TypeError, match=error_text):
        card_es.set_secondary_time_index("transactions", new_2nd_ti)
    # add mixed secondary time indexes
    new_2nd_ti = {
        "transaction_city": ["transaction_city", "fraud"],
        "fraud_decision_time": ["fraud_decision_time", "fraud"],
    }
    with pytest.raises(TypeError, match=error_text):
        card_es.set_secondary_time_index("transactions", new_2nd_ti)

    # add bool secondary time index
    error_text = "transactions time index not recognized as numeric or datetime"
    with pytest.raises(TypeError, match=error_text):
        card_es.set_secondary_time_index("transactions", {"fraud": ["fraud"]})


def test_normalize_dataframe(es):
    error_text = "'additional_columns' must be a list, but received type.*"
    with pytest.raises(TypeError, match=error_text):
        es.normalize_dataframe(
            "sessions",
            "device_types",
            "device_type",
            additional_columns="log",
        )

    error_text = "'copy_columns' must be a list, but received type.*"
    with pytest.raises(TypeError, match=error_text):
        es.normalize_dataframe(
            "sessions",
            "device_types",
            "device_type",
            copy_columns="log",
        )

    es.normalize_dataframe(
        "sessions",
        "device_types",
        "device_type",
        additional_columns=["device_name"],
        make_time_index=False,
    )

    assert len(es.get_forward_relationships("sessions")) == 2
    assert (
        es.get_forward_relationships("sessions")[1].parent_dataframe.ww.name
        == "device_types"
    )
    assert "device_name" in es["device_types"].columns
    assert "device_name" not in es["sessions"].columns
    assert "device_type" in es["device_types"].columns


def test_normalize_dataframe_add_index_as_column(es):
    error_text = "Not adding device_type as both index and column in additional_columns"
    with pytest.raises(ValueError, match=error_text):
        es.normalize_dataframe(
            "sessions",
            "device_types",
            "device_type",
            additional_columns=["device_name", "device_type"],
            make_time_index=False,
        )

    error_text = "Not adding device_type as both index and column in copy_columns"
    with pytest.raises(ValueError, match=error_text):
        es.normalize_dataframe(
            "sessions",
            "device_types",
            "device_type",
            copy_columns=["device_name", "device_type"],
            make_time_index=False,
        )


def test_normalize_dataframe_new_time_index_in_base_dataframe_error_check(es):
    error_text = "'make_time_index' must be a column in the base dataframe"
    with pytest.raises(ValueError, match=error_text):
        es.normalize_dataframe(
            base_dataframe_name="customers",
            new_dataframe_name="cancellations",
            index="cancel_reason",
            make_time_index="non-existent",
        )


def test_normalize_dataframe_new_time_index_in_column_list_error_check(es):
    error_text = (
        "'make_time_index' must be specified in 'additional_columns' or 'copy_columns'"
    )
    with pytest.raises(ValueError, match=error_text):
        es.normalize_dataframe(
            base_dataframe_name="customers",
            new_dataframe_name="cancellations",
            index="cancel_reason",
            make_time_index="cancel_date",
        )


def test_normalize_dataframe_new_time_index_copy_success_check(es):
    es.normalize_dataframe(
        base_dataframe_name="customers",
        new_dataframe_name="cancellations",
        index="cancel_reason",
        make_time_index="cancel_date",
        additional_columns=[],
        copy_columns=["cancel_date"],
    )


def test_normalize_dataframe_new_time_index_additional_success_check(es):
    es.normalize_dataframe(
        base_dataframe_name="customers",
        new_dataframe_name="cancellations",
        index="cancel_reason",
        make_time_index="cancel_date",
        additional_columns=["cancel_date"],
        copy_columns=[],
    )


@pytest.fixture
def normalize_es():
    df = pd.DataFrame(
        {
            "id": [0, 1, 2, 3],
            "A": [5, 4, 2, 3],
            "time": [
                datetime(2020, 6, 3),
                (datetime(2020, 3, 12)),
                datetime(2020, 5, 1),
                datetime(2020, 4, 22),
            ],
        },
    )
    es = EntitySet("es")
    return es.add_dataframe(dataframe_name="data", dataframe=df, index="id")


def test_normalize_time_index_from_none(normalize_es):
    assert normalize_es["data"].ww.time_index is None

    normalize_es.normalize_dataframe(
        base_dataframe_name="data",
        new_dataframe_name="normalized",
        index="A",
        make_time_index="time",
        copy_columns=["time"],
    )
    assert normalize_es["normalized"].ww.time_index == "time"
    df = normalize_es["normalized"]

    assert df["time"].is_monotonic_increasing


def test_raise_error_if_dupicate_additional_columns_passed(es):
    error_text = (
        "'additional_columns' contains duplicate columns. All columns must be unique."
    )
    with pytest.raises(ValueError, match=error_text):
        es.normalize_dataframe(
            "sessions",
            "device_types",
            "device_type",
            additional_columns=["device_name", "device_name"],
        )


def test_raise_error_if_dupicate_copy_columns_passed(es):
    error_text = (
        "'copy_columns' contains duplicate columns. All columns must be unique."
    )
    with pytest.raises(ValueError, match=error_text):
        es.normalize_dataframe(
            "sessions",
            "device_types",
            "device_type",
            copy_columns=["device_name", "device_name"],
        )


def test_normalize_dataframe_copies_logical_types(es):
    es["log"].ww.set_types(
        logical_types={
            "value": Ordinal(
                order=[0.0, 1.0, 2.0, 3.0, 5.0, 7.0, 10.0, 14.0, 15.0, 20.0],
            ),
        },
    )

    assert isinstance(es["log"].ww.logical_types["value"], Ordinal)
    assert len(es["log"].ww.logical_types["value"].order) == 10
    assert isinstance(es["log"].ww.logical_types["priority_level"], Ordinal)
    assert len(es["log"].ww.logical_types["priority_level"].order) == 3
    es.normalize_dataframe(
        "log",
        "values_2",
        "value_2",
        additional_columns=["priority_level"],
        copy_columns=["value"],
        make_time_index=False,
    )

    assert len(es.get_forward_relationships("log")) == 3
    assert es.get_forward_relationships("log")[2].parent_dataframe.ww.name == "values_2"
    assert "priority_level" in es["values_2"].columns
    assert "value" in es["values_2"].columns
    assert "priority_level" not in es["log"].columns
    assert "value" in es["log"].columns
    assert "value_2" in es["values_2"].columns
    assert isinstance(es["values_2"].ww.logical_types["priority_level"], Ordinal)
    assert len(es["values_2"].ww.logical_types["priority_level"].order) == 3
    assert isinstance(es["values_2"].ww.logical_types["value"], Ordinal)
    assert len(es["values_2"].ww.logical_types["value"].order) == 10


def test_make_time_index_keeps_original_sorting():
    trips = {
        "trip_id": [999 - i for i in range(1000)],
        "flight_time": [datetime(1997, 4, 1) for i in range(1000)],
        "flight_id": [1 for i in range(350)] + [2 for i in range(650)],
    }
    order = [i for i in range(1000)]
    df = pd.DataFrame.from_dict(trips)
    es = EntitySet("flights")
    es.add_dataframe(
        dataframe=df,
        dataframe_name="trips",
        index="trip_id",
        time_index="flight_time",
    )
    assert (es["trips"]["trip_id"] == order).all()
    es.normalize_dataframe(
        base_dataframe_name="trips",
        new_dataframe_name="flights",
        index="flight_id",
        make_time_index=True,
    )
    assert (es["trips"]["trip_id"] == order).all()


def test_normalize_dataframe_new_time_index(es):
    new_time_index = "value_time"
    es.normalize_dataframe(
        "log",
        "values",
        "value",
        make_time_index=True,
        new_dataframe_time_index=new_time_index,
    )

    assert es["values"].ww.time_index == new_time_index
    assert new_time_index in es["values"].columns
    assert len(es["values"].columns) == 2
    df = es["values"]
    assert df[new_time_index].is_monotonic_increasing


def test_normalize_dataframe_same_index(es):
    transactions_df = pd.DataFrame(
        {
            "id": [1, 2, 3],
            "transaction_time": pd.date_range(start="10:00", periods=3, freq="10s"),
            "first_df_time": [1, 2, 3],
        },
    )
    es = EntitySet("example")
    es.add_dataframe(
        dataframe_name="df",
        index="id",
        time_index="transaction_time",
        dataframe=transactions_df,
    )

    error_text = "'index' must be different from the index column of the base dataframe"
    with pytest.raises(ValueError, match=error_text):
        es.normalize_dataframe(
            base_dataframe_name="df",
            new_dataframe_name="new_dataframe",
            index="id",
            make_time_index=True,
        )


def test_secondary_time_index(es):
    es.normalize_dataframe(
        "log",
        "values",
        "value",
        make_time_index=True,
        make_secondary_time_index={"datetime": ["comments"]},
        new_dataframe_time_index="value_time",
        new_dataframe_secondary_time_index="second_ti",
    )

    assert isinstance(es["values"].ww.logical_types["second_ti"], Datetime)
    assert es["values"].ww.semantic_tags["second_ti"] == set()
    assert es["values"].ww.metadata["secondary_time_index"] == {
        "second_ti": ["comments", "second_ti"],
    }


def test_sizeof(es):
    es.add_last_time_indexes()
    total_size = 0
    for df in es.dataframes:
        total_size += df.__sizeof__()

    assert es.__sizeof__() == total_size


def test_construct_without_id():
    assert EntitySet().id is None


def test_repr_without_id():
    match = "Entityset: None\n  DataFrames:\n  Relationships:\n    No relationships"
    assert repr(EntitySet()) == match


def test_getitem_without_id():
    error_text = "DataFrame test does not exist in entity set"
    with pytest.raises(KeyError, match=error_text):
        EntitySet()["test"]


def test_metadata_without_id():
    es = EntitySet()
    assert es.metadata.id is None


@pytest.fixture
def datetime3():
    return pd.DataFrame({"id": [0, 1, 2], "ints": ["1", "2", "1"]})


def test_datetime64_conversion(datetime3):
    df = datetime3
    df["time"] = pd.Timestamp.now()
    df["time"] = df["time"].dt.tz_localize("UTC")

    es = EntitySet(id="test")
    es.add_dataframe(
        dataframe_name="test_dataframe",
        index="id",
        dataframe=df,
        logical_types=None,
    )
    es["test_dataframe"].ww.set_time_index("time")
    assert es["test_dataframe"].ww.time_index == "time"


@pytest.fixture
def index_df():
    return pd.DataFrame(
        {
            "id": [1, 2, 3, 4, 5, 6],
            "transaction_time": pd.date_range(start="10:00", periods=6, freq="10s"),
            "first_dataframe_time": [1, 2, 3, 5, 6, 6],
        },
    )


def test_same_index_values(index_df):
    es = EntitySet("example")

    error_text = (
        '"id" is already set as the index. An index cannot also be the time index.'
    )
    with pytest.raises(ValueError, match=error_text):
        es.add_dataframe(
            dataframe_name="dataframe",
            index="id",
            time_index="id",
            dataframe=index_df,
            logical_types=None,
        )

    es.add_dataframe(
        dataframe_name="dataframe",
        index="id",
        time_index="transaction_time",
        dataframe=index_df,
        logical_types=None,
    )

    error_text = "time_index and index cannot be the same value, first_dataframe_time"
    with pytest.raises(ValueError, match=error_text):
        es.normalize_dataframe(
            base_dataframe_name="dataframe",
            new_dataframe_name="new_dataframe",
            index="first_dataframe_time",
            make_time_index=True,
        )


def test_use_time_index(index_df):
    bad_ltypes = {"transaction_time": Datetime}
    bad_semantic_tags = {"transaction_time": "time_index"}
    logical_types = None

    es = EntitySet()

    error_text = re.escape(
        "Cannot add 'time_index' tag directly for column transaction_time. To set a column as the time index, use DataFrame.ww.set_time_index() instead.",
    )
    with pytest.raises(ValueError, match=error_text):
        es.add_dataframe(
            dataframe_name="dataframe",
            index="id",
            logical_types=bad_ltypes,
            semantic_tags=bad_semantic_tags,
            dataframe=index_df,
        )

    es.add_dataframe(
        dataframe_name="dataframe",
        index="id",
        time_index="transaction_time",
        logical_types=logical_types,
        dataframe=index_df,
    )


def test_normalize_with_datetime_time_index(es):
    es.normalize_dataframe(
        base_dataframe_name="customers",
        new_dataframe_name="cancel_reason",
        index="cancel_reason",
        make_time_index=False,
        copy_columns=["signup_date", "upgrade_date"],
    )

    assert isinstance(es["cancel_reason"].ww.logical_types["signup_date"], Datetime)
    assert isinstance(es["cancel_reason"].ww.logical_types["upgrade_date"], Datetime)


def test_normalize_with_numeric_time_index(int_es):
    int_es.normalize_dataframe(
        base_dataframe_name="customers",
        new_dataframe_name="cancel_reason",
        index="cancel_reason",
        make_time_index=False,
        copy_columns=["signup_date", "upgrade_date"],
    )

    assert int_es["cancel_reason"].ww.semantic_tags["signup_date"] == {"numeric"}


def test_normalize_with_invalid_time_index(es):
    error_text = "Time index column must contain datetime or numeric values"
    with pytest.raises(TypeError, match=error_text):
        es.normalize_dataframe(
            base_dataframe_name="customers",
            new_dataframe_name="cancel_reason",
            index="cancel_reason",
            copy_columns=["upgrade_date", "favorite_quote"],
            make_time_index="favorite_quote",
        )


def test_entityset_init():
    cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]})
    transactions_df = pd.DataFrame(
        {
            "id": [1, 2, 3, 4, 5, 6],
            "card_id": [1, 2, 1, 3, 4, 5],
            "transaction_time": [10, 12, 13, 20, 21, 20],
            "upgrade_date": [51, 23, 45, 12, 22, 53],
            "fraud": [True, False, False, False, True, True],
        },
    )
    logical_types = {"fraud": "boolean", "card_id": "integer"}
    dataframes = {
        "cards": (cards_df.copy(), "id", None, {"id": "Integer"}),
        "transactions": (
            transactions_df.copy(),
            "id",
            "transaction_time",
            logical_types,
            None,
            False,
        ),
    }
    relationships = [("cards", "id", "transactions", "card_id")]
    es = EntitySet(id="fraud_data", dataframes=dataframes, relationships=relationships)
    assert es["transactions"].ww.index == "id"
    assert es["transactions"].ww.time_index == "transaction_time"
    es_copy = EntitySet(id="fraud_data")
    es_copy.add_dataframe(dataframe_name="cards", dataframe=cards_df.copy(), index="id")
    es_copy.add_dataframe(
        dataframe_name="transactions",
        dataframe=transactions_df.copy(),
        index="id",
        logical_types=logical_types,
        make_index=False,
        time_index="transaction_time",
    )
    es_copy.add_relationship("cards", "id", "transactions", "card_id")

    assert es["cards"].ww == es_copy["cards"].ww
    assert es["transactions"].ww == es_copy["transactions"].ww


def test_add_interesting_values_specified_vals(es):
    product_vals = ["coke zero", "taco clock"]
    country_vals = ["AL", "US"]
    interesting_values = {
        "product_id": product_vals,
        "countrycode": country_vals,
    }
    es.add_interesting_values(dataframe_name="log", values=interesting_values)

    assert es["log"].ww["product_id"].ww.metadata["interesting_values"] == product_vals
    assert es["log"].ww["countrycode"].ww.metadata["interesting_values"] == country_vals


def test_add_interesting_values_vals_specified_without_dataframe_name(es):
    interesting_values = {
        "countrycode": ["AL", "US"],
    }
    error_msg = "dataframe_name must be specified if values are provided"
    with pytest.raises(ValueError, match=error_msg):
        es.add_interesting_values(values=interesting_values)


def test_add_interesting_values_single_dataframe(es):
    es.add_interesting_values(dataframe_name="log")

    expected_vals = {
        "zipcode": ["02116", "02116-3899", "12345-6789", "1234567890", "0"],
        "countrycode": ["US", "AL", "ALB", "USA"],
        "subregioncode": ["US-AZ", "US-MT", "ZM-06", "UG-219"],
        "priority_level": [0, 1, 2],
    }

    for col in es["log"].columns:
        if col in expected_vals:
            assert (
                es["log"].ww.columns[col].metadata.get("interesting_values")
                == expected_vals[col]
            )
        else:
            assert es["log"].ww.columns[col].metadata.get("interesting_values") is None


def test_add_interesting_values_multiple_dataframes(es):
    es.add_interesting_values()
    expected_cols_with_vals = {
        "régions": {"language"},
        "stores": {},
        "products": {"department"},
        "customers": {"cancel_reason", "engagement_level"},
        "sessions": {"device_type", "device_name"},
        "log": {"zipcode", "countrycode", "subregioncode", "priority_level"},
        "cohorts": {"cohort_name"},
    }
    for df_id, df in es.dataframe_dict.items():
        expected_cols = expected_cols_with_vals[df_id]
        for col in df.columns:
            if col in expected_cols:
                assert df.ww.columns[col].metadata.get("interesting_values") is not None
            else:
                assert df.ww.columns[col].metadata.get("interesting_values") is None


def test_add_interesting_values_verbose_output(caplog):
    es = load_retail(nrows=200)
    es["order_products"].ww.set_types({"quantity": "Categorical"})
    es["orders"].ww.set_types({"country": "Categorical"})
    logger = logging.getLogger("featuretools")
    logger.propagate = True
    logger_es = logging.getLogger("featuretools.entityset")
    logger_es.propagate = True
    es.add_interesting_values(verbose=True, max_values=10)
    logger.propagate = False
    logger_es.propagate = False
    assert (
        "Column country: Marking United Kingdom as an interesting value" in caplog.text
    )
    assert "Column quantity: Marking 6 as an interesting value" in caplog.text


def test_entityset_equality(es):
    first_es = EntitySet()
    second_es = EntitySet()
    assert first_es == second_es

    first_es.add_dataframe(
        dataframe_name="customers",
        dataframe=es["customers"].copy(),
        index="id",
        time_index="signup_date",
        logical_types=es["customers"].ww.logical_types,
        semantic_tags=get_df_tags(es["customers"]),
    )
    assert first_es != second_es

    second_es.add_dataframe(
        dataframe_name="sessions",
        dataframe=es["sessions"].copy(),
        index="id",
        logical_types=es["sessions"].ww.logical_types,
        semantic_tags=get_df_tags(es["sessions"]),
    )
    assert first_es != second_es

    first_es.add_dataframe(
        dataframe_name="sessions",
        dataframe=es["sessions"].copy(),
        index="id",
        logical_types=es["sessions"].ww.logical_types,
        semantic_tags=get_df_tags(es["sessions"]),
    )
    second_es.add_dataframe(
        dataframe_name="customers",
        dataframe=es["customers"].copy(),
        index="id",
        time_index="signup_date",
        logical_types=es["customers"].ww.logical_types,
        semantic_tags=get_df_tags(es["customers"]),
    )
    assert first_es == second_es

    first_es.add_relationship("customers", "id", "sessions", "customer_id")
    assert first_es != second_es
    assert second_es != first_es

    second_es.add_relationship("customers", "id", "sessions", "customer_id")
    assert first_es == second_es


def test_entityset_dataframe_dict_and_relationship_equality(es):
    first_es = EntitySet()
    second_es = EntitySet()

    first_es.add_dataframe(
        dataframe_name="sessions",
        dataframe=es["sessions"].copy(),
        index="id",
        logical_types=es["sessions"].ww.logical_types,
        semantic_tags=get_df_tags(es["sessions"]),
    )

    # Tests if two entity sets are not equal if they have a different
    # number of dataframes attached.
    # first_es has 1 dataframe, second_es has 0 dataframes attached.
    assert first_es != second_es

    second_es.add_dataframe(
        dataframe_name="customers",
        dataframe=es["customers"].copy(),
        index="id",
        logical_types=es["customers"].ww.logical_types,
        semantic_tags=get_df_tags(es["customers"]),
    )

    # Tests if two entity sets are not equal if they have a different
    # dataframes attached.
    # first_es has the sessions dataframe attached,
    # second_es has the customers dataframe attached.
    assert first_es != second_es

    first_es.add_dataframe(
        dataframe_name="customers",
        dataframe=es["customers"].copy(),
        index="id",
        logical_types=es["customers"].ww.logical_types,
        semantic_tags=get_df_tags(es["customers"]),
    )
    first_es.add_dataframe(
        dataframe_name="stores",
        dataframe=es["stores"].copy(),
        index="id",
        logical_types=es["stores"].ww.logical_types,
        semantic_tags=get_df_tags(es["stores"]),
    )
    first_es.add_dataframe(
        dataframe_name="régions",
        dataframe=es["régions"].copy(),
        index="id",
        logical_types=es["régions"].ww.logical_types,
        semantic_tags=get_df_tags(es["régions"]),
    )

    second_es.add_dataframe(
        dataframe_name="sessions",
        dataframe=es["sessions"].copy(),
        index="id",
        logical_types=es["sessions"].ww.logical_types,
        semantic_tags=get_df_tags(es["sessions"]),
    )
    second_es.add_dataframe(
        dataframe_name="stores",
        dataframe=es["stores"].copy(),
        index="id",
        logical_types=es["stores"].ww.logical_types,
        semantic_tags=get_df_tags(es["stores"]),
    )
    second_es.add_dataframe(
        dataframe_name="régions",
        dataframe=es["régions"].copy(),
        index="id",
        logical_types=es["régions"].ww.logical_types,
        semantic_tags=get_df_tags(es["régions"]),
    )

    # Now the two entity sets should be equal,
    # since they have the same dataframes.
    assert first_es == second_es

    first_es.add_relationship("customers", "id", "sessions", "customer_id")
    second_es.add_relationship("régions", "id", "stores", "région_id")

    # Test if two entity sets are not equal
    # if they have different relationships.
    assert first_es != second_es


def test_entityset_id_equality():
    first_es = EntitySet(id="first")
    first_es_copy = EntitySet(id="first")
    second_es = EntitySet(id="second")

    assert first_es != second_es
    assert first_es == first_es_copy


def test_entityset_time_type_equality():
    first_es = EntitySet()
    second_es = EntitySet()
    assert first_es == second_es

    first_es.time_type = "numeric"
    assert first_es != second_es

    second_es.time_type = Datetime
    assert first_es != second_es

    second_es.time_type = "numeric"
    assert first_es == second_es


def test_entityset_deep_equality(es):
    first_es = EntitySet()
    second_es = EntitySet()

    first_es.add_dataframe(
        dataframe_name="customers",
        dataframe=es["customers"].copy(),
        index="id",
        time_index="signup_date",
        logical_types=es["customers"].ww.logical_types,
        semantic_tags=get_df_tags(es["customers"]),
    )
    first_es.add_dataframe(
        dataframe_name="sessions",
        dataframe=es["sessions"].copy(),
        index="id",
        logical_types=es["sessions"].ww.logical_types,
        semantic_tags=get_df_tags(es["sessions"]),
    )

    second_es.add_dataframe(
        dataframe_name="sessions",
        dataframe=es["sessions"].copy(),
        index="id",
        logical_types=es["sessions"].ww.logical_types,
        semantic_tags=get_df_tags(es["sessions"]),
    )
    second_es.add_dataframe(
        dataframe_name="customers",
        dataframe=es["customers"].copy(),
        index="id",
        time_index="signup_date",
        logical_types=es["customers"].ww.logical_types,
        semantic_tags=get_df_tags(es["customers"]),
    )

    assert first_es.__eq__(second_es, deep=False)
    assert first_es.__eq__(second_es, deep=True)

    # Woodwork metadata only gets included in deep equality check
    first_es["sessions"].ww.metadata["created_by"] = "user0"

    assert first_es.__eq__(second_es, deep=False)
    assert not first_es.__eq__(second_es, deep=True)

    second_es["sessions"].ww.metadata["created_by"] = "user0"

    assert first_es.__eq__(second_es, deep=False)
    assert first_es.__eq__(second_es, deep=True)

    updated_df = first_es["customers"].loc[[2, 0], :]
    first_es.replace_dataframe("customers", updated_df)

    assert first_es.__eq__(second_es, deep=False)
    assert not first_es.__eq__(second_es, deep=True)


def test_deepcopy_entityset(make_es):
    # Uses make_es since the es fixture uses deepcopy
    copied_es = copy.deepcopy(make_es)

    assert copied_es == make_es
    assert copied_es is not make_es

    for df_name in make_es.dataframe_dict.keys():
        original_df = make_es[df_name]
        new_df = copied_es[df_name]

        assert new_df.ww.schema == original_df.ww.schema
        assert new_df.ww._schema is not original_df.ww._schema

        pd.testing.assert_frame_equal(new_df, original_df)
        assert new_df is not original_df


def test_deepcopy_entityset_woodwork_changes(es):
    copied_es = copy.deepcopy(es)

    assert copied_es == es
    assert copied_es is not es

    copied_es["products"].ww.add_semantic_tags({"id": "new_tag"})

    assert copied_es["products"].ww.semantic_tags["id"] == {"index", "new_tag"}
    assert es["products"].ww.semantic_tags["id"] == {"index"}
    assert copied_es != es


def test_deepcopy_entityset_featuretools_changes(es):
    copied_es = copy.deepcopy(es)

    assert copied_es == es
    assert copied_es is not es

    copied_es.set_secondary_time_index(
        "customers",
        {"upgrade_date": ["engagement_level"]},
    )
    assert copied_es["customers"].ww.metadata["secondary_time_index"] == {
        "upgrade_date": ["engagement_level", "upgrade_date"],
    }
    assert es["customers"].ww.metadata["secondary_time_index"] == {
        "cancel_date": ["cancel_reason", "cancel_date"],
    }


def test_es__getstate__key_unique(es):
    assert not hasattr(es, WW_SCHEMA_KEY)


def test_es_pickling(es):
    pkl = pickle.dumps(es)
    unpickled = pickle.loads(pkl)

    assert es.__eq__(unpickled, deep=True)
    assert not hasattr(unpickled, WW_SCHEMA_KEY)


def test_empty_es_pickling():
    es = EntitySet(id="empty")
    pkl = pickle.dumps(es)
    unpickled = pickle.loads(pkl)

    assert es.__eq__(unpickled, deep=True)


@patch("featuretools.entityset.entityset.EntitySet.add_dataframe")
def test_setitem(add_dataframe):
    es = EntitySet()
    df = pd.DataFrame()
    es["new_df"] = df
    assert add_dataframe.called
    add_dataframe.assert_called_with(dataframe=df, dataframe_name="new_df")


def test_latlong_nan_normalization(latlong_df):
    latlong_df.ww.init(
        name="latLong",
        index="idx",
        logical_types={"latLong": "LatLong"},
    )

    dataframes = {"latLong": (latlong_df,)}

    relationships = []

    es = EntitySet("latlong-test", dataframes, relationships)

    normalized_df = es["latLong"]

    expected_df = pd.DataFrame(
        {"idx": [0, 1, 2], "latLong": [(np.nan, np.nan), (1, 2), (np.nan, np.nan)]},
    )

    pd.testing.assert_frame_equal(normalized_df, expected_df)


def test_latlong_nan_normalization_add_dataframe(latlong_df):
    latlong_df.ww.init(
        name="latLong",
        index="idx",
        logical_types={"latLong": "LatLong"},
    )

    es = EntitySet("latlong-test")

    es.add_dataframe(latlong_df)

    normalized_df = es["latLong"]

    expected_df = pd.DataFrame(
        {"idx": [0, 1, 2], "latLong": [(np.nan, np.nan), (1, 2), (np.nan, np.nan)]},
    )

    pd.testing.assert_frame_equal(normalized_df, expected_df)


================================================
FILE: featuretools/tests/entityset_tests/test_es_metadata.py
================================================
import pandas as pd
import pytest

from featuretools import EntitySet
from featuretools.tests.testing_utils import backward_path, forward_path


def test_cannot_re_add_relationships_that_already_exists(es):
    before_len = len(es.relationships)
    es.add_relationship(relationship=es.relationships[0])
    after_len = len(es.relationships)
    assert before_len == after_len


def test_add_relationships_convert_type(es):
    for r in es.relationships:
        assert r.parent_dataframe.ww.index == r._parent_column_name
        assert "foreign_key" in r.child_column.ww.semantic_tags
        assert r.child_column.ww.logical_type == r.parent_column.ww.logical_type


def test_get_forward_dataframes(es):
    dataframes = es.get_forward_dataframes("log")
    path_to_sessions = forward_path(es, ["log", "sessions"])
    path_to_products = forward_path(es, ["log", "products"])
    assert list(dataframes) == [
        ("sessions", path_to_sessions),
        ("products", path_to_products),
    ]


def test_get_backward_dataframes(es):
    dataframes = es.get_backward_dataframes("customers")
    path_to_sessions = backward_path(es, ["customers", "sessions"])
    assert list(dataframes) == [("sessions", path_to_sessions)]


def test_get_forward_dataframes_deep(es):
    dataframes = es.get_forward_dataframes("log", deep=True)
    path_to_sessions = forward_path(es, ["log", "sessions"])
    path_to_products = forward_path(es, ["log", "products"])
    path_to_customers = forward_path(es, ["log", "sessions", "customers"])
    path_to_regions = forward_path(es, ["log", "sessions", "customers", "régions"])
    path_to_cohorts = forward_path(es, ["log", "sessions", "customers", "cohorts"])
    assert list(dataframes) == [
        ("sessions", path_to_sessions),
        ("customers", path_to_customers),
        ("cohorts", path_to_cohorts),
        ("régions", path_to_regions),
        ("products", path_to_products),
    ]


def test_get_backward_dataframes_deep(es):
    dataframes = es.get_backward_dataframes("customers", deep=True)
    path_to_log = backward_path(es, ["customers", "sessions", "log"])
    path_to_sessions = backward_path(es, ["customers", "sessions"])
    assert list(dataframes) == [("sessions", path_to_sessions), ("log", path_to_log)]


def test_get_forward_relationships(es):
    relationships = es.get_forward_relationships("log")
    assert len(relationships) == 2
    assert relationships[0]._parent_dataframe_name == "sessions"
    assert relationships[0]._child_dataframe_name == "log"
    assert relationships[1]._parent_dataframe_name == "products"
    assert relationships[1]._child_dataframe_name == "log"

    relationships = es.get_forward_relationships("sessions")
    assert len(relationships) == 1
    assert relationships[0]._parent_dataframe_name == "customers"
    assert relationships[0]._child_dataframe_name == "sessions"


def test_get_backward_relationships(es):
    relationships = es.get_backward_relationships("sessions")
    assert len(relationships) == 1
    assert relationships[0]._parent_dataframe_name == "sessions"
    assert relationships[0]._child_dataframe_name == "log"

    relationships = es.get_backward_relationships("customers")
    assert len(relationships) == 1
    assert relationships[0]._parent_dataframe_name == "customers"
    assert relationships[0]._child_dataframe_name == "sessions"


def test_find_forward_paths(es):
    paths = list(es.find_forward_paths("log", "customers"))
    assert len(paths) == 1

    path = paths[0]

    assert len(path) == 2
    assert path[0]._child_dataframe_name == "log"
    assert path[0]._parent_dataframe_name == "sessions"
    assert path[1]._child_dataframe_name == "sessions"
    assert path[1]._parent_dataframe_name == "customers"


def test_find_forward_paths_multiple_paths(diamond_es):
    paths = list(diamond_es.find_forward_paths("transactions", "regions"))
    assert len(paths) == 2

    path1, path2 = paths

    r1, r2 = path1
    assert r1._child_dataframe_name == "transactions"
    assert r1._parent_dataframe_name == "stores"
    assert r2._child_dataframe_name == "stores"
    assert r2._parent_dataframe_name == "regions"

    r1, r2 = path2
    assert r1._child_dataframe_name == "transactions"
    assert r1._parent_dataframe_name == "customers"
    assert r2._child_dataframe_name == "customers"
    assert r2._parent_dataframe_name == "regions"


def test_find_forward_paths_multiple_relationships(games_es):
    paths = list(games_es.find_forward_paths("games", "teams"))
    assert len(paths) == 2

    path1, path2 = paths
    assert len(path1) == 1
    assert len(path2) == 1
    r1 = path1[0]
    r2 = path2[0]

    assert r1._child_dataframe_name == "games"
    assert r2._child_dataframe_name == "games"
    assert r1._parent_dataframe_name == "teams"
    assert r2._parent_dataframe_name == "teams"

    assert r1._child_column_name == "home_team_id"
    assert r2._child_column_name == "away_team_id"
    assert r1._parent_column_name == "id"
    assert r2._parent_column_name == "id"


@pytest.fixture
def employee_df():
    return pd.DataFrame({"id": [0], "manager_id": [0]})


def test_find_forward_paths_ignores_loops(employee_df):
    dataframes = {"employees": (employee_df, "id")}
    relationships = [("employees", "id", "employees", "manager_id")]
    es = EntitySet(dataframes=dataframes, relationships=relationships)

    paths = list(es.find_forward_paths("employees", "employees"))
    assert len(paths) == 1
    assert paths[0] == []


def test_find_backward_paths(es):
    paths = list(es.find_backward_paths("customers", "log"))
    assert len(paths) == 1

    path = paths[0]

    assert len(path) == 2
    assert path[0]._child_dataframe_name == "sessions"
    assert path[0]._parent_dataframe_name == "customers"
    assert path[1]._child_dataframe_name == "log"
    assert path[1]._parent_dataframe_name == "sessions"


def test_find_backward_paths_multiple_paths(diamond_es):
    paths = list(diamond_es.find_backward_paths("regions", "transactions"))
    assert len(paths) == 2

    path1, path2 = paths

    r1, r2 = path1
    assert r1._child_dataframe_name == "stores"
    assert r1._parent_dataframe_name == "regions"
    assert r2._child_dataframe_name == "transactions"
    assert r2._parent_dataframe_name == "stores"

    r1, r2 = path2
    assert r1._child_dataframe_name == "customers"
    assert r1._parent_dataframe_name == "regions"
    assert r2._child_dataframe_name == "transactions"
    assert r2._parent_dataframe_name == "customers"


def test_find_backward_paths_multiple_relationships(games_es):
    paths = list(games_es.find_backward_paths("teams", "games"))
    assert len(paths) == 2

    path1, path2 = paths
    assert len(path1) == 1
    assert len(path2) == 1
    r1 = path1[0]
    r2 = path2[0]

    assert r1._child_dataframe_name == "games"
    assert r2._child_dataframe_name == "games"
    assert r1._parent_dataframe_name == "teams"
    assert r2._parent_dataframe_name == "teams"

    assert r1._child_column_name == "home_team_id"
    assert r2._child_column_name == "away_team_id"
    assert r1._parent_column_name == "id"
    assert r2._parent_column_name == "id"


def test_has_unique_path(diamond_es):
    assert diamond_es.has_unique_forward_path("customers", "regions")
    assert not diamond_es.has_unique_forward_path("transactions", "regions")


def test_raise_key_error_missing_dataframe(es):
    error_text = "DataFrame testing does not exist in ecommerce"
    with pytest.raises(KeyError, match=error_text):
        es["testing"]

    es_without_id = EntitySet()
    error_text = "DataFrame testing does not exist in entity set"
    with pytest.raises(KeyError, match=error_text):
        es_without_id["testing"]


def test_add_parent_not_index_column(es):
    error_text = "Parent column 'language' is not the index of dataframe régions"
    with pytest.raises(AttributeError, match=error_text):
        es.add_relationship("régions", "language", "customers", "région_id")


================================================
FILE: featuretools/tests/entityset_tests/test_last_time_index.py
================================================
from datetime import datetime

import pandas as pd
import pytest
from woodwork.logical_types import Categorical, Datetime, Integer

from featuretools.entityset.entityset import LTI_COLUMN_NAME


@pytest.fixture
def values_es(es):
    es.normalize_dataframe(
        "log",
        "values",
        "value",
        make_time_index=True,
        new_dataframe_time_index="value_time",
    )
    return es


@pytest.fixture
def true_values_lti():
    true_values_lti = pd.Series(
        [
            datetime(2011, 4, 10, 10, 41, 0),
            datetime(2011, 4, 9, 10, 31, 9),
            datetime(2011, 4, 9, 10, 31, 18),
            datetime(2011, 4, 9, 10, 31, 27),
            datetime(2011, 4, 10, 10, 40, 1),
            datetime(2011, 4, 10, 10, 41, 3),
            datetime(2011, 4, 9, 10, 30, 12),
            datetime(2011, 4, 10, 10, 41, 6),
            datetime(2011, 4, 9, 10, 30, 18),
            datetime(2011, 4, 9, 10, 30, 24),
            datetime(2011, 4, 10, 11, 10, 3),
        ],
    )
    return true_values_lti


@pytest.fixture
def true_sessions_lti():
    sessions_lti = pd.Series(
        [
            datetime(2011, 4, 9, 10, 30, 24),
            datetime(2011, 4, 9, 10, 31, 27),
            datetime(2011, 4, 9, 10, 40, 0),
            datetime(2011, 4, 10, 10, 40, 1),
            datetime(2011, 4, 10, 10, 41, 6),
            datetime(2011, 4, 10, 11, 10, 3),
        ],
    )
    return sessions_lti


@pytest.fixture
def wishlist_df():
    wishlist_df = pd.DataFrame(
        {
            "session_id": [0, 1, 2, 2, 3, 4, 5],
            "datetime": [
                datetime(2011, 4, 9, 10, 30, 15),
                datetime(2011, 4, 9, 10, 31, 30),
                datetime(2011, 4, 9, 10, 30, 30),
                datetime(2011, 4, 9, 10, 35, 30),
                datetime(2011, 4, 10, 10, 41, 0),
                datetime(2011, 4, 10, 10, 39, 59),
                datetime(2011, 4, 10, 11, 10, 2),
            ],
            "product_id": [
                "coke zero",
                "taco clock",
                "coke zero",
                "car",
                "toothpaste",
                "brown bag",
                "coke zero",
            ],
        },
    )
    return wishlist_df


@pytest.fixture
def extra_session_df(es):
    row_values = {"customer_id": 2, "device_name": "PC", "device_type": 0, "id": 6}
    row = pd.DataFrame(row_values, index=pd.Index([6], name="id"))
    df = es["sessions"]
    df = pd.concat([df, row]).sort_index()
    return df


class TestLastTimeIndex(object):
    def test_leaf(self, es):
        es.add_last_time_indexes()
        log = es["log"]
        lti_name = log.ww.metadata.get("last_time_index")

        assert lti_name == LTI_COLUMN_NAME
        assert len(log[lti_name]) == 17

        log_df = log

        for v1, v2 in zip(log_df[lti_name], log_df["datetime"]):
            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2

    def test_leaf_no_time_index(self, es):
        es.add_last_time_indexes()
        stores = es["stores"]
        true_lti = pd.Series([None for x in range(6)], dtype="datetime64[ns]")

        assert len(true_lti) == len(stores[LTI_COLUMN_NAME])

        stores_lti = stores[LTI_COLUMN_NAME]

        for v1, v2 in zip(stores_lti, true_lti):
            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2

    # TODO: possible issue with either normalize_dataframe or add_last_time_indexes
    def test_parent(self, values_es, true_values_lti):
        # test dataframe with time index and all instances in child dataframe
        values_es.add_last_time_indexes()
        values = values_es["values"]
        lti_name = values.ww.metadata.get("last_time_index")
        assert len(values[lti_name]) == 10
        sorted_lti = values[lti_name].sort_index()
        for v1, v2 in zip(sorted_lti, true_values_lti):
            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2

    def test_parent_some_missing(self, values_es, true_values_lti):
        # test dataframe with time index and not all instances have children
        values = values_es["values"]

        # add extra value instance with no children
        row_values = {
            "value": [21.0],
            "value_time": [pd.Timestamp("2011-04-10 11:10:02")],
        }
        # make sure index doesn't have same name as column to suppress pandas warning
        row = pd.DataFrame(row_values, index=pd.Index([21]))
        df = pd.concat([values, row])
        df = df.sort_values(by="value")
        df.index.name = None

        values_es.replace_dataframe(dataframe_name="values", df=df)
        values_es.add_last_time_indexes()
        # lti value should default to instance's time index
        true_values_lti[10] = pd.Timestamp("2011-04-10 11:10:02")
        true_values_lti[11] = pd.Timestamp("2011-04-10 11:10:03")

        values = values_es["values"]
        lti_name = values.ww.metadata.get("last_time_index")
        assert len(values[lti_name]) == 11
        sorted_lti = values[lti_name].sort_index()
        for v1, v2 in zip(sorted_lti, true_values_lti):
            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2

    def test_parent_no_time_index(self, es, true_sessions_lti):
        # test dataframe without time index and all instances have children
        es.add_last_time_indexes()
        sessions = es["sessions"]
        lti_name = sessions.ww.metadata.get("last_time_index")
        assert len(sessions[lti_name]) == 6
        sorted_lti = sessions[lti_name].sort_index()
        for v1, v2 in zip(sorted_lti, true_sessions_lti):
            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2

    def test_parent_no_time_index_missing(
        self,
        es,
        extra_session_df,
        true_sessions_lti,
    ):
        # test dataframe without time index and not all instance have children

        # add session instance with no associated log instances
        es.replace_dataframe(dataframe_name="sessions", df=extra_session_df)
        es.add_last_time_indexes()
        # since sessions has no time index, default value is NaT
        true_sessions_lti[6] = pd.NaT
        sessions = es["sessions"]

        lti_name = sessions.ww.metadata.get("last_time_index")
        assert len(sessions[lti_name]) == 7
        sorted_lti = sessions[lti_name].sort_index()
        for v1, v2 in zip(sorted_lti, true_sessions_lti):
            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2

    def test_multiple_children(self, es, wishlist_df, true_sessions_lti):
        # test all instances in both children
        logical_types = {
            "session_id": Integer,
            "datetime": Datetime,
            "product_id": Categorical,
        }
        es.add_dataframe(
            dataframe_name="wishlist_log",
            dataframe=wishlist_df,
            index="id",
            make_index=True,
            time_index="datetime",
            logical_types=logical_types,
        )
        es.add_relationship("sessions", "id", "wishlist_log", "session_id")
        es.add_last_time_indexes()
        sessions = es["sessions"]
        # wishlist df has more recent events for two session ids
        true_sessions_lti[1] = pd.Timestamp("2011-4-9 10:31:30")
        true_sessions_lti[3] = pd.Timestamp("2011-4-10 10:41:00")

        lti_name = sessions.ww.metadata.get("last_time_index")
        assert len(sessions[lti_name]) == 6
        sorted_lti = sessions[lti_name].sort_index()
        for v1, v2 in zip(sorted_lti, true_sessions_lti):
            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2

    def test_multiple_children_right_missing(self, es, wishlist_df, true_sessions_lti):
        # test all instances in left child

        # drop wishlist instance related to id 3 so it's only in log
        wishlist_df.drop(4, inplace=True)
        logical_types = {
            "session_id": Integer,
            "datetime": Datetime,
            "product_id": Categorical,
        }
        es.add_dataframe(
            dataframe_name="wishlist_log",
            dataframe=wishlist_df,
            index="id",
            make_index=True,
            time_index="datetime",
            logical_types=logical_types,
        )
        es.add_relationship("sessions", "id", "wishlist_log", "session_id")
        es.add_last_time_indexes()
        sessions = es["sessions"]

        # now only session id 1 has newer event in wishlist_log
        true_sessions_lti[1] = pd.Timestamp("2011-4-9 10:31:30")

        lti_name = sessions.ww.metadata.get("last_time_index")
        assert len(sessions[lti_name]) == 6
        sorted_lti = sessions[lti_name].sort_index()
        for v1, v2 in zip(sorted_lti, true_sessions_lti):
            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2

    def test_multiple_children_left_missing(
        self,
        es,
        extra_session_df,
        wishlist_df,
        true_sessions_lti,
    ):
        # add row to sessions so not all session instances are in log
        es.replace_dataframe(dataframe_name="sessions", df=extra_session_df)

        # add row to wishlist df so new session instance in in wishlist_log
        row_values = {
            "session_id": [6],
            "datetime": [pd.Timestamp("2011-04-11 11:11:11")],
            "product_id": ["toothpaste"],
        }
        row = pd.DataFrame(row_values, index=pd.RangeIndex(start=7, stop=8))
        df = pd.concat([wishlist_df, row])
        logical_types = {
            "session_id": Integer,
            "datetime": Datetime,
            "product_id": Categorical,
        }
        es.add_dataframe(
            dataframe_name="wishlist_log",
            dataframe=df,
            index="id",
            make_index=True,
            time_index="datetime",
            logical_types=logical_types,
        )
        es.add_relationship("sessions", "id", "wishlist_log", "session_id")
        es.add_last_time_indexes()

        # test all instances in right child
        sessions = es["sessions"]

        # now wishlist_log has newer events for 3 session ids
        true_sessions_lti[1] = pd.Timestamp("2011-4-9 10:31:30")
        true_sessions_lti[3] = pd.Timestamp("2011-4-10 10:41:00")
        true_sessions_lti[6] = pd.Timestamp("2011-04-11 11:11:11")

        lti_name = sessions.ww.metadata.get("last_time_index")
        assert len(sessions[lti_name]) == 7
        sorted_lti = sessions[lti_name].sort_index()
        for v1, v2 in zip(sorted_lti, true_sessions_lti):
            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2

    def test_multiple_children_all_combined(
        self,
        es,
        extra_session_df,
        wishlist_df,
        true_sessions_lti,
    ):
        # add row to sessions so not all session instances are in log
        es.replace_dataframe(dataframe_name="sessions", df=extra_session_df)

        # add row to wishlist_log so extra session has child instance
        row_values = {
            "session_id": [6],
            "datetime": [pd.Timestamp("2011-04-11 11:11:11")],
            "product_id": ["toothpaste"],
        }
        row = pd.DataFrame(row_values, index=pd.RangeIndex(start=7, stop=8))
        df = pd.concat([wishlist_df, row])

        # drop instance 4 so wishlist_log does not have session id 3 instance
        df.drop(4, inplace=True)
        logical_types = {
            "session_id": Integer,
            "datetime": Datetime,
            "product_id": Categorical,
        }
        es.add_dataframe(
            dataframe_name="wishlist_log",
            dataframe=df,
            index="id",
            make_index=True,
            time_index="datetime",
            logical_types=logical_types,
        )
        es.add_relationship("sessions", "id", "wishlist_log", "session_id")
        es.add_last_time_indexes()

        # test some instances in right, some in left, all when combined
        sessions = es["sessions"]

        # wishlist has newer events for 2 sessions
        true_sessions_lti[1] = pd.Timestamp("2011-4-9 10:31:30")
        true_sessions_lti[6] = pd.Timestamp("2011-04-11 11:11:11")

        lti_name = sessions.ww.metadata.get("last_time_index")
        assert len(sessions[lti_name]) == 7
        sorted_lti = sessions[lti_name].sort_index()
        for v1, v2 in zip(sorted_lti, true_sessions_lti):
            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2

    def test_multiple_children_both_missing(
        self,
        es,
        extra_session_df,
        wishlist_df,
        true_sessions_lti,
    ):
        # test all instances in neither child
        sessions = es["sessions"]

        logical_types = {
            "session_id": Integer,
            "datetime": Datetime,
            "product_id": Categorical,
        }
        # add row to sessions to create session with no events
        es.replace_dataframe(dataframe_name="sessions", df=extra_session_df)

        es.add_dataframe(
            dataframe_name="wishlist_log",
            dataframe=wishlist_df,
            index="id",
            make_index=True,
            time_index="datetime",
            logical_types=logical_types,
        )
        es.add_relationship("sessions", "id", "wishlist_log", "session_id")
        es.add_last_time_indexes()
        sessions = es["sessions"]

        # wishlist has 2 newer events and one is NaT
        true_sessions_lti[1] = pd.Timestamp("2011-4-9 10:31:30")
        true_sessions_lti[3] = pd.Timestamp("2011-4-10 10:41:00")
        true_sessions_lti[6] = pd.NaT

        lti_name = sessions.ww.metadata.get("last_time_index")
        assert len(sessions[lti_name]) == 7
        sorted_lti = sessions[lti_name].sort_index()
        for v1, v2 in zip(sorted_lti, true_sessions_lti):
            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2

    def test_grandparent(self, es):
        # test sorting by time works correctly across several generations
        df = es["log"]

        # For one user, change a log event to be newer than the user's normal
        # last time index. This event should be from a different session than
        # the current last time index.
        df["datetime"][5] = pd.Timestamp("2011-4-09 10:40:01")
        df = (
            df.set_index("datetime", append=True)
            .sort_index(level=[1, 0], kind="mergesort")
            .reset_index("datetime", drop=False)
        )
        es.replace_dataframe(dataframe_name="log", df=df)
        es.add_last_time_indexes()
        customers = es["customers"]

        true_customers_lti = pd.Series(
            [
                datetime(2011, 4, 9, 10, 40, 1),
                datetime(2011, 4, 10, 10, 41, 6),
                datetime(2011, 4, 10, 11, 10, 3),
            ],
        )

        lti_name = customers.ww.metadata.get("last_time_index")
        assert len(customers[lti_name]) == 3
        sorted_lti = customers.sort_values("id")[lti_name]
        for v1, v2 in zip(sorted_lti, true_customers_lti):
            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2


================================================
FILE: featuretools/tests/entityset_tests/test_plotting.py
================================================
import os
import re

import graphviz
import pandas as pd
import pytest

from featuretools import EntitySet


@pytest.fixture
def simple_es():
    es = EntitySet("test")
    df = pd.DataFrame({"foo": [1]})
    es.add_dataframe(df, dataframe_name="test", index="foo")
    return es


def test_returns_digraph_object(es):
    graph = es.plot()

    assert isinstance(graph, graphviz.Digraph)


def test_saving_png_file(es, tmp_path):
    output_path = str(tmp_path.joinpath("test1.png"))

    es.plot(to_file=output_path)

    assert os.path.isfile(output_path)
    os.remove(output_path)


def test_missing_file_extension(es):
    output_path = "test1"

    with pytest.raises(ValueError) as excinfo:
        es.plot(to_file=output_path)

    assert str(excinfo.value).startswith("Please use a file extension")


def test_invalid_format(es):
    output_path = "test1.xzy"

    with pytest.raises(ValueError) as excinfo:
        es.plot(to_file=output_path)

    assert str(excinfo.value).startswith("Unknown format")


def test_multiple_rows(es):
    plot_ = es.plot()
    result = re.findall(r"\((\d+\srows?)\)", plot_.source)
    expected = ["{} rows".format(str(i.shape[0])) for i in es.dataframes]
    assert result == expected


def test_single_row(simple_es):
    plot_ = simple_es.plot()
    result = re.findall(r"\((\d+\srows?)\)", plot_.source)
    expected = ["1 row"]
    assert result == expected


================================================
FILE: featuretools/tests/entityset_tests/test_relationship.py
================================================
from featuretools.entityset.relationship import Relationship, RelationshipPath


def test_relationship_path(es):
    log_to_sessions = Relationship(es, "sessions", "id", "log", "session_id")
    sessions_to_customers = Relationship(
        es,
        "customers",
        "id",
        "sessions",
        "customer_id",
    )
    path_list = [
        (True, log_to_sessions),
        (True, sessions_to_customers),
        (False, sessions_to_customers),
    ]
    path = RelationshipPath(path_list)

    for i, edge in enumerate(path_list):
        assert path[i] == edge

    assert [edge for edge in path] == path_list


def test_relationship_path_name(es):
    assert RelationshipPath([]).name == ""

    log_to_sessions = Relationship(es, "sessions", "id", "log", "session_id")
    sessions_to_customers = Relationship(
        es,
        "customers",
        "id",
        "sessions",
        "customer_id",
    )

    forward_path = [(True, log_to_sessions), (True, sessions_to_customers)]
    assert RelationshipPath(forward_path).name == "sessions.customers"

    backward_path = [(False, sessions_to_customers), (False, log_to_sessions)]
    assert RelationshipPath(backward_path).name == "sessions.log"

    mixed_path = [(True, log_to_sessions), (False, log_to_sessions)]
    assert RelationshipPath(mixed_path).name == "sessions.log"


def test_relationship_path_dataframes(es):
    assert list(RelationshipPath([]).dataframes()) == []

    log_to_sessions = Relationship(es, "sessions", "id", "log", "session_id")
    sessions_to_customers = Relationship(
        es,
        "customers",
        "id",
        "sessions",
        "customer_id",
    )

    forward_path = [(True, log_to_sessions), (True, sessions_to_customers)]
    assert list(RelationshipPath(forward_path).dataframes()) == [
        "log",
        "sessions",
        "customers",
    ]

    backward_path = [(False, sessions_to_customers), (False, log_to_sessions)]
    assert list(RelationshipPath(backward_path).dataframes()) == [
        "customers",
        "sessions",
        "log",
    ]

    mixed_path = [(True, log_to_sessions), (False, log_to_sessions)]
    assert list(RelationshipPath(mixed_path).dataframes()) == ["log", "sessions", "log"]


def test_names_when_multiple_relationships_between_dataframes(games_es):
    relationship = Relationship(games_es, "teams", "id", "games", "home_team_id")
    assert relationship.child_name == "games[home_team_id]"
    assert relationship.parent_name == "teams[home_team_id]"


def test_names_when_no_other_relationship_between_dataframes(home_games_es):
    relationship = Relationship(home_games_es, "teams", "id", "games", "home_team_id")
    assert relationship.child_name == "games"
    assert relationship.parent_name == "teams"


def test_relationship_serialization(es):
    relationship = Relationship(es, "sessions", "id", "log", "session_id")

    dictionary = {
        "parent_dataframe_name": "sessions",
        "parent_column_name": "id",
        "child_dataframe_name": "log",
        "child_column_name": "session_id",
    }
    assert relationship.to_dictionary() == dictionary
    assert Relationship.from_dictionary(dictionary, es) == relationship


================================================
FILE: featuretools/tests/entityset_tests/test_serialization.py
================================================
import json
import logging
import os
import tempfile
from unittest.mock import MagicMock, patch
from urllib.request import urlretrieve

import boto3
import pandas as pd
import pytest
import woodwork.type_sys.type_system as ww_type_system
from woodwork.logical_types import LogicalType, Ordinal
from woodwork.serializers.serializer_base import typing_info_to_dict
from woodwork.type_sys.utils import list_logical_types

from featuretools.entityset import EntitySet, deserialize, serialize
from featuretools.version import ENTITYSET_SCHEMA_VERSION

BUCKET_NAME = "test-bucket"
WRITE_KEY_NAME = "test-key"
TEST_S3_URL = "s3://{}/{}".format(BUCKET_NAME, WRITE_KEY_NAME)
TEST_FILE = "test_serialization_data_entityset_schema_{}_2022_09_02.tar".format(
    ENTITYSET_SCHEMA_VERSION,
)
S3_URL = "s3://featuretools-static/" + TEST_FILE
URL = "https://featuretools-static.s3.amazonaws.com/" + TEST_FILE
TEST_KEY = "test_access_key_es"


def test_entityset_description(es):
    description = serialize.entityset_to_description(es)
    _es = deserialize.description_to_entityset(description)
    assert es.metadata.__eq__(_es, deep=True)


def test_all_ww_logical_types():
    logical_types = list_logical_types()["type_string"].to_list()
    dataframe = pd.DataFrame(columns=logical_types)
    es = EntitySet()
    ltype_dict = {ltype: ltype for ltype in logical_types}
    ltype_dict["ordinal"] = Ordinal(order=[])
    es.add_dataframe(
        dataframe=dataframe,
        dataframe_name="all_types",
        index="integer",
        logical_types=ltype_dict,
    )
    description = serialize.entityset_to_description(es)
    _es = deserialize.description_to_entityset(description)
    assert es.__eq__(_es, deep=True)


def test_with_custom_ww_logical_type():
    class CustomLogicalType(LogicalType):
        pass

    ww_type_system.add_type(CustomLogicalType)
    columns = ["integer", "natural_language", "custom_logical_type"]
    dataframe = pd.DataFrame(columns=columns)
    es = EntitySet()
    ltype_dict = {
        "integer": "integer",
        "natural_language": "natural_language",
        "custom_logical_type": CustomLogicalType,
    }
    es.add_dataframe(
        dataframe=dataframe,
        dataframe_name="custom_type",
        index="integer",
        logical_types=ltype_dict,
    )
    description = serialize.entityset_to_description(es)
    _es = deserialize.description_to_entityset(description)
    assert isinstance(
        _es["custom_type"].ww.logical_types["custom_logical_type"],
        CustomLogicalType,
    )
    assert es.__eq__(_es, deep=True)


def test_serialize_invalid_formats(es, tmp_path):
    error_text = "must be one of the following formats: {}"
    error_text = error_text.format(", ".join(serialize.FORMATS))
    with pytest.raises(ValueError, match=error_text):
        serialize.write_data_description(es, path=str(tmp_path), format="")


def test_empty_dataframe(es):
    for df in es.dataframes:
        description = typing_info_to_dict(df)
        dataframe = deserialize.empty_dataframe(description)
        assert dataframe.empty
        assert all(dataframe.columns == df.columns)


def test_to_csv(es, tmp_path):
    es.to_csv(str(tmp_path), encoding="utf-8", engine="python")
    new_es = deserialize.read_entityset(str(tmp_path))
    assert es.__eq__(new_es, deep=True)
    df = es["log"]
    new_df = new_es["log"]
    assert type(df["latlong"][0]) in (tuple, list)
    assert type(new_df["latlong"][0]) in (tuple, list)


def test_to_csv_interesting_values(es, tmp_path):
    es.add_interesting_values()
    es.to_csv(str(tmp_path))
    new_es = deserialize.read_entityset(str(tmp_path))
    assert es.__eq__(new_es, deep=True)


def test_to_csv_manual_interesting_values(es, tmp_path):
    es.add_interesting_values(
        dataframe_name="log",
        values={"product_id": ["coke_zero"]},
    )
    es.to_csv(str(tmp_path))
    new_es = deserialize.read_entityset(str(tmp_path))
    assert es.__eq__(new_es, deep=True)
    assert new_es["log"].ww["product_id"].ww.metadata["interesting_values"] == [
        "coke_zero",
    ]


def test_to_pickle(es, tmp_path):
    es.to_pickle(str(tmp_path))
    new_es = deserialize.read_entityset(str(tmp_path))
    assert es.__eq__(new_es, deep=True)
    assert type(es["log"]["latlong"][0]) == tuple
    assert type(new_es["log"]["latlong"][0]) == tuple


def test_to_pickle_interesting_values(es, tmp_path):
    es.add_interesting_values()
    es.to_pickle(str(tmp_path))
    new_es = deserialize.read_entityset(str(tmp_path))
    assert es.__eq__(new_es, deep=True)


def test_to_pickle_manual_interesting_values(es, tmp_path):
    es.add_interesting_values(
        dataframe_name="log",
        values={"product_id": ["coke_zero"]},
    )
    es.to_pickle(str(tmp_path))
    new_es = deserialize.read_entityset(str(tmp_path))
    assert es.__eq__(new_es, deep=True)
    assert new_es["log"].ww["product_id"].ww.metadata["interesting_values"] == [
        "coke_zero",
    ]


def test_to_parquet(es, tmp_path):
    es.to_parquet(str(tmp_path))
    new_es = deserialize.read_entityset(str(tmp_path))
    assert es.__eq__(new_es, deep=True)
    df = es["log"]
    new_df = new_es["log"]
    assert type(df["latlong"][0]) in (tuple, list)
    assert type(new_df["latlong"][0]) in (tuple, list)


def test_to_parquet_manual_interesting_values(es, tmp_path):
    es.add_interesting_values(
        dataframe_name="log",
        values={"product_id": ["coke_zero"]},
    )
    es.to_parquet(str(tmp_path))
    new_es = deserialize.read_entityset(str(tmp_path))
    assert es.__eq__(new_es, deep=True)
    assert new_es["log"].ww["product_id"].ww.metadata["interesting_values"] == [
        "coke_zero",
    ]


def test_to_parquet_interesting_values(es, tmp_path):
    es.add_interesting_values()
    es.to_parquet(str(tmp_path))
    new_es = deserialize.read_entityset(str(tmp_path))
    assert es.__eq__(new_es, deep=True)


def test_to_parquet_with_lti(tmp_path, mock_customer):
    es = mock_customer
    es.to_parquet(str(tmp_path))
    new_es = deserialize.read_entityset(str(tmp_path))
    assert es.__eq__(new_es, deep=True)


def test_to_pickle_id_none(tmp_path):
    es = EntitySet()
    es.to_pickle(str(tmp_path))
    new_es = deserialize.read_entityset(str(tmp_path))
    assert es.__eq__(new_es, deep=True)


# TODO: Fix Moto tests needing to explicitly set permissions for objects
@pytest.fixture
def s3_client():
    _environ = os.environ.copy()
    from moto import mock_aws

    with mock_aws():
        s3 = boto3.resource("s3")
        yield s3
    os.environ.clear()
    os.environ.update(_environ)


@pytest.fixture
def s3_bucket(s3_client, region="us-east-2"):
    location = {"LocationConstraint": region}
    s3_client.create_bucket(
        Bucket=BUCKET_NAME,
        ACL="public-read-write",
        CreateBucketConfiguration=location,
    )
    s3_bucket = s3_client.Bucket(BUCKET_NAME)
    yield s3_bucket


def make_public(s3_client, s3_bucket):
    obj = list(s3_bucket.objects.all())[0].key
    s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL="public-read-write")


@pytest.mark.parametrize("profile_name", [None, False])
def test_serialize_s3_csv(es, s3_client, s3_bucket, profile_name):
    es.to_csv(TEST_S3_URL, encoding="utf-8", engine="python", profile_name=profile_name)
    make_public(s3_client, s3_bucket)
    new_es = deserialize.read_entityset(TEST_S3_URL, profile_name=profile_name)
    assert es.__eq__(new_es, deep=True)


@pytest.mark.parametrize("profile_name", [None, False])
def test_serialize_s3_pickle(es, s3_client, s3_bucket, profile_name):
    es.to_pickle(TEST_S3_URL, profile_name=profile_name)
    make_public(s3_client, s3_bucket)
    new_es = deserialize.read_entityset(TEST_S3_URL, profile_name=profile_name)
    assert es.__eq__(new_es, deep=True)


@pytest.mark.parametrize("profile_name", [None, False])
def test_serialize_s3_parquet(es, s3_client, s3_bucket, profile_name):
    es.to_parquet(TEST_S3_URL, profile_name=profile_name)
    make_public(s3_client, s3_bucket)
    new_es = deserialize.read_entityset(TEST_S3_URL, profile_name=profile_name)
    assert es.__eq__(new_es, deep=True)


def test_s3_test_profile(es, s3_client, s3_bucket, setup_test_profile):
    es.to_csv(TEST_S3_URL, encoding="utf-8", engine="python", profile_name="test")
    make_public(s3_client, s3_bucket)
    new_es = deserialize.read_entityset(TEST_S3_URL, profile_name="test")
    assert es.__eq__(new_es, deep=True)


def test_serialize_url_csv(es):
    error_text = "Writing to URLs is not supported"
    with pytest.raises(ValueError, match=error_text):
        es.to_csv(URL, encoding="utf-8", engine="python")


def test_serialize_subdirs_not_removed(es, tmp_path):
    write_path = tmp_path.joinpath("test")
    write_path.mkdir()
    test_dir = write_path.joinpath("test_dir")
    test_dir.mkdir()
    description_path = write_path.joinpath("data_description.json")
    with open(description_path, "w") as f:
        json.dump("__SAMPLE_TEXT__", f)
    compression = None
    serialize.write_data_description(
        es,
        path=str(write_path),
        index="1",
        sep="\t",
        encoding="utf-8",
        compression=compression,
    )
    assert os.path.exists(str(test_dir))
    with open(description_path, "r") as f:
        assert "__SAMPLE_TEXT__" not in json.load(f)


def test_deserialize_local_tar(es):
    with tempfile.TemporaryDirectory() as tmp_path:
        temp_tar_filepath = os.path.join(tmp_path, TEST_FILE)
        urlretrieve(URL, filename=temp_tar_filepath)
        new_es = deserialize.read_entityset(temp_tar_filepath)
        assert es.__eq__(new_es, deep=True)


@patch("featuretools.entityset.deserialize.getfullargspec")
def test_deserialize_errors_if_python_version_unsafe(mock_inspect, es):
    mock_response = MagicMock()
    mock_response.kwonlyargs = []
    mock_inspect.return_value = mock_response
    with tempfile.TemporaryDirectory() as tmp_path:
        temp_tar_filepath = os.path.join(tmp_path, TEST_FILE)
        urlretrieve(URL, filename=temp_tar_filepath)
        with pytest.raises(RuntimeError, match=""):
            deserialize.read_entityset(temp_tar_filepath)


def test_deserialize_url_csv(es):
    new_es = deserialize.read_entityset(URL)
    assert es.__eq__(new_es, deep=True)


def test_deserialize_s3_csv(es):
    new_es = deserialize.read_entityset(S3_URL, profile_name=False)
    assert es.__eq__(new_es, deep=True)


def test_operations_invalidate_metadata(es):
    new_es = EntitySet(id="test")
    # test metadata gets created on access
    assert new_es._data_description is None
    assert new_es.metadata is not None  # generated after access
    assert new_es._data_description is not None
    customers_ltypes = None
    new_es.add_dataframe(
        es["customers"],
        "customers",
        logical_types=customers_ltypes,
    )
    sessions_ltypes = None
    new_es.add_dataframe(
        es["sessions"],
        "sessions",
        logical_types=sessions_ltypes,
    )

    assert new_es._data_description is None
    assert new_es.metadata is not None
    assert new_es._data_description is not None

    new_es = new_es.add_relationship("customers", "id", "sessions", "customer_id")
    assert new_es._data_description is None
    assert new_es.metadata is not None
    assert new_es._data_description is not None

    new_es = new_es.normalize_dataframe("customers", "cohort", "cohort")
    assert new_es._data_description is None
    assert new_es.metadata is not None
    assert new_es._data_description is not None

    new_es.add_last_time_indexes()
    assert new_es._data_description is None
    assert new_es.metadata is not None
    assert new_es._data_description is not None

    new_es.add_interesting_values()
    assert new_es._data_description is None
    assert new_es.metadata is not None
    assert new_es._data_description is not None


def test_reset_metadata(es):
    assert es.metadata is not None
    assert es._data_description is not None
    es.reset_data_description()
    assert es._data_description is None


@patch("featuretools.utils.schema_utils.ENTITYSET_SCHEMA_VERSION", "1.1.1")
@pytest.mark.parametrize(
    "hardcoded_schema_version, warns",
    [("2.1.1", True), ("1.2.1", True), ("1.1.2", True), ("1.0.2", False)],
)
def test_later_schema_version(es, caplog, hardcoded_schema_version, warns):
    def test_version(version, warns):
        if warns:
            warning_text = (
                "The schema version of the saved entityset"
                "(%s) is greater than the latest supported (%s). "
                "You may need to upgrade featuretools. Attempting to load entityset ..."
                % (version, "1.1.1")
            )
        else:
            warning_text = None

        _check_schema_version(version, es, warning_text, caplog, "warn")

    test_version(hardcoded_schema_version, warns)


@patch("featuretools.utils.schema_utils.ENTITYSET_SCHEMA_VERSION", "1.1.1")
@pytest.mark.parametrize(
    "hardcoded_schema_version, warns",
    [("0.1.1", True), ("1.0.1", False), ("1.1.0", False)],
)
def test_earlier_schema_version(
    es,
    caplog,
    monkeypatch,
    hardcoded_schema_version,
    warns,
):
    def test_version(version, warns):
        if warns:
            warning_text = (
                "The schema version of the saved entityset"
                "(%s) is no longer supported by this version "
                "of featuretools. Attempting to load entityset ..." % version
            )
        else:
            warning_text = None

        _check_schema_version(version, es, warning_text, caplog, "log")

    test_version(hardcoded_schema_version, warns)


def _check_schema_version(version, es, warning_text, caplog, warning_type=None):
    dataframes = {
        dataframe.ww.name: typing_info_to_dict(dataframe) for dataframe in es.dataframes
    }
    relationships = [relationship.to_dictionary() for relationship in es.relationships]
    dictionary = {
        "schema_version": version,
        "id": es.id,
        "dataframes": dataframes,
        "relationships": relationships,
    }

    if warning_type == "warn" and warning_text:
        with pytest.warns(UserWarning) as record:
            deserialize.description_to_entityset(dictionary)
        assert record[0].message.args[0] == warning_text
    elif warning_type == "log":
        logger = logging.getLogger("featuretools")
        logger.propagate = True
        deserialize.description_to_entityset(dictionary)
        if warning_text:
            assert warning_text in caplog.text
        else:
            assert not len(caplog.text)
        logger.propagate = False


================================================
FILE: featuretools/tests/entityset_tests/test_timedelta.py
================================================
import pandas as pd
import pytest
from dateutil.relativedelta import relativedelta

from featuretools.entityset import Timedelta
from featuretools.feature_base import Feature
from featuretools.primitives import Count
from featuretools.utils.wrangle import _check_timedelta


def test_timedelta_equality():
    assert Timedelta(10, "d") == Timedelta(10, "d")
    assert Timedelta(10, "d") != 1


def test_singular():
    assert Timedelta.make_singular("Month") == "Month"
    assert Timedelta.make_singular("Months") == "Month"


def test_delta_with_observations(es):
    four_delta = Timedelta(4, "observations")
    assert not four_delta.is_absolute()
    assert four_delta.get_value("o") == 4

    neg_four_delta = -four_delta
    assert not neg_four_delta.is_absolute()
    assert neg_four_delta.get_value("o") == -4

    time = pd.to_datetime("2019-05-01")

    error_txt = "Invalid unit"
    with pytest.raises(Exception, match=error_txt):
        time + four_delta

    with pytest.raises(Exception, match=error_txt):
        time - four_delta


def test_delta_with_time_unit_matches_pandas(es):
    customer_id = 0
    sessions_df = es["sessions"]
    sessions_df = sessions_df[sessions_df["customer_id"] == customer_id]
    log_df = es["log"]
    log_df = log_df[log_df["session_id"].isin(sessions_df["id"])]
    all_times = log_df["datetime"].sort_values().tolist()

    # 4 observation delta
    value = 4
    unit = "h"
    delta = Timedelta(value, unit)
    neg_delta = -delta
    # first plus 4 obs is fifth
    assert all_times[0] + delta == all_times[0] + pd.Timedelta(value, unit)
    # using negative
    assert all_times[0] - neg_delta == all_times[0] + pd.Timedelta(value, unit)

    # fifth minus 4 obs is first
    assert all_times[4] - delta == all_times[4] - pd.Timedelta(value, unit)
    # using negative
    assert all_times[4] + neg_delta == all_times[4] - pd.Timedelta(value, unit)


def test_check_timedelta(es):
    time_units = list(Timedelta._readable_units.keys())
    expanded_units = list(Timedelta._readable_units.values())
    exp_to_standard_unit = {e: t for e, t in zip(expanded_units, time_units)}
    singular_units = [u[:-1] for u in expanded_units]
    sing_to_standard_unit = {s: t for s, t in zip(singular_units, time_units)}
    to_standard_unit = {}
    to_standard_unit.update(exp_to_standard_unit)
    to_standard_unit.update(sing_to_standard_unit)
    full_units = singular_units + expanded_units + time_units + time_units

    strings = ["2 {}".format(u) for u in singular_units + expanded_units + time_units]
    strings += ["2{}".format(u) for u in time_units]
    for i, s in enumerate(strings):
        unit = full_units[i]
        standard_unit = unit
        if unit in to_standard_unit:
            standard_unit = to_standard_unit[unit]

        td = _check_timedelta(s)
        assert td.get_value(standard_unit) == 2


def test_check_pd_timedelta(es):
    pdtd = pd.Timedelta(5, "m")
    td = _check_timedelta(pdtd)
    assert td.get_value("s") == 300


def test_string_timedelta_args():
    assert Timedelta("1 second") == Timedelta(1, "second")
    assert Timedelta("1 seconds") == Timedelta(1, "second")
    assert Timedelta("10 days") == Timedelta(10, "days")
    assert Timedelta("100 days") == Timedelta(100, "days")
    assert Timedelta("1001 days") == Timedelta(1001, "days")
    assert Timedelta("1001 weeks") == Timedelta(1001, "weeks")


def test_feature_takes_timedelta_string(es):
    feature = Feature(
        Feature(es["log"].ww["id"]),
        parent_dataframe_name="customers",
        use_previous="1 day",
        primitive=Count,
    )
    assert feature.use_previous == Timedelta(1, "d")


def test_deltas_week(es):
    customer_id = 0
    sessions_df = es["sessions"]
    sessions_df = sessions_df[sessions_df["customer_id"] == customer_id]
    log_df = es["log"]
    log_df = log_df[log_df["session_id"].isin(sessions_df["id"])]
    all_times = log_df["datetime"].sort_values().tolist()
    delta_week = Timedelta(1, "w")
    delta_days = Timedelta(7, "d")

    assert all_times[0] + delta_days == all_times[0] + delta_week


def test_relative_year():
    td_time = "1 years"
    td = _check_timedelta(td_time)
    assert td.get_value("Y") == 1
    assert isinstance(td.delta_obj, relativedelta)

    time = pd.to_datetime("2020-02-29")
    assert time + td == pd.to_datetime("2021-02-28")


def test_serialization():
    times = [Timedelta(1, unit="w"), Timedelta(3, unit="d"), Timedelta(5, unit="o")]

    dictionaries = [
        {"value": 1, "unit": "w"},
        {"value": 3, "unit": "d"},
        {"value": 5, "unit": "o"},
    ]

    for td, expected in zip(times, dictionaries):
        assert expected == td.get_arguments()

    for expected, dictionary in zip(times, dictionaries):
        assert expected == Timedelta.from_dictionary(dictionary)

    # Test multiple temporal parameters separately since it is not deterministic
    mult_time = {"years": 4, "months": 3, "days": 2}
    mult_td = Timedelta(mult_time)

    # Serialize
    td_units = mult_td.get_arguments()["unit"]
    td_values = mult_td.get_arguments()["value"]
    arg_list = list(zip(td_values, td_units))

    assert (4, "Y") in arg_list
    assert (3, "mo") in arg_list
    assert (2, "d") in arg_list

    # Deserialize
    assert mult_td == Timedelta.from_dictionary(
        {"value": [4, 3, 2], "unit": ["Y", "mo", "d"]},
    )


def test_relative_month():
    td_time = "1 month"
    td = _check_timedelta(td_time)
    assert td.get_value("mo") == 1
    assert isinstance(td.delta_obj, relativedelta)

    time = pd.to_datetime("2020-01-31")
    assert time + td == pd.to_datetime("2020-02-29")

    td_time = "6 months"
    td = _check_timedelta(td_time)
    assert td.get_value("mo") == 6
    assert isinstance(td.delta_obj, relativedelta)

    time = pd.to_datetime("2020-01-31")
    assert time + td == pd.to_datetime("2020-07-31")


def test_has_multiple_units():
    single_unit = pd.DateOffset(months=3)
    multiple_units = pd.DateOffset(months=3, years=3, days=5)
    single_td = _check_timedelta(single_unit)
    multiple_td = _check_timedelta(multiple_units)
    assert single_td.has_multiple_units() is False
    assert multiple_td.has_multiple_units() is True


def test_pd_dateoffset_to_timedelta():
    single_temporal = pd.DateOffset(months=3)
    single_td = _check_timedelta(single_temporal)
    assert single_td.get_value("mo") == 3
    assert single_td.delta_obj == pd.DateOffset(months=3)

    mult_temporal = pd.DateOffset(years=10, months=3, days=5)
    mult_td = _check_timedelta(mult_temporal)
    expected = {"Y": 10, "mo": 3, "d": 5}
    assert mult_td.get_value() == expected
    assert mult_td.delta_obj == mult_temporal
    # get_name() for multiple values is not deterministic
    assert len(mult_td.get_name()) == len("10 Years 3 Months 5 Days")

    special_dateoffset = pd.offsets.BDay(100)
    special_td = _check_timedelta(special_dateoffset)
    assert special_td.get_value("businessdays") == 100
    assert special_td.delta_obj == special_dateoffset


def test_pd_dateoffset_to_timedelta_math():
    base = pd.to_datetime("2020-01-31")
    add = _check_timedelta(pd.DateOffset(months=2))
    res = base + add
    assert res == pd.to_datetime("2020-03-31")

    base_2 = pd.to_datetime("2020-01-31")
    add_2 = _check_timedelta(pd.DateOffset(months=2, days=3))
    res_2 = base_2 + add_2
    assert res_2 == pd.to_datetime("2020-04-03")

    base_3 = pd.to_datetime("2019-09-20")
    sub = _check_timedelta(pd.offsets.BDay(10))
    res_3 = base_3 - sub
    assert res_3 == pd.to_datetime("2019-09-06")


================================================
FILE: featuretools/tests/entityset_tests/test_ww_es.py
================================================
from datetime import datetime

import numpy as np
import pandas as pd
import pytest
from woodwork.exceptions import TypeConversionError
from woodwork.logical_types import (
    Boolean,
    Categorical,
    Datetime,
    Double,
    Integer,
    NaturalLanguage,
)

from featuretools.entityset.entityset import LTI_COLUMN_NAME, EntitySet


def test_empty_es():
    es = EntitySet("es")
    assert es.id == "es"
    assert es.dataframe_dict == {}
    assert es.relationships == []
    assert es.time_type is None


@pytest.fixture
def df():
    return pd.DataFrame({"id": [0, 1, 2], "category": ["a", "b", "c"]}).astype(
        {"category": "category"},
    )


def test_init_es_with_dataframe(df):
    es = EntitySet("es", dataframes={"table": (df, "id")})
    assert es.id == "es"
    assert len(es.dataframe_dict) == 1
    assert es["table"] is df

    assert es["table"].ww.schema is not None
    assert isinstance(es["table"].ww.logical_types["id"], Integer)
    assert isinstance(es["table"].ww.logical_types["category"], Categorical)


def test_init_es_with_woodwork_table_same_name(df):
    df.ww.init(index="id", name="table")
    es = EntitySet("es", dataframes={"table": (df,)})

    assert es.id == "es"
    assert len(es.dataframe_dict) == 1
    assert es["table"] is df

    assert es["table"].ww.schema is not None

    assert es["table"].ww.index == "id"
    assert es["table"].ww.time_index is None

    assert isinstance(es["table"].ww.logical_types["id"], Integer)
    assert isinstance(es["table"].ww.logical_types["category"], Categorical)


def test_init_es_with_woodwork_table_diff_name_error(df):
    df.ww.init(index="id", name="table")
    error = "Naming conflict in dataframes dictionary: dictionary key 'diff_name' does not match dataframe name 'table'"
    with pytest.raises(ValueError, match=error):
        EntitySet("es", dataframes={"diff_name": (df,)})


def test_init_es_with_dataframe_and_params(df):
    logical_types = {"id": "NaturalLanguage", "category": NaturalLanguage}
    semantic_tags = {"category": "new_tag"}
    es = EntitySet(
        "es",
        dataframes={"table": (df, "id", None, logical_types, semantic_tags)},
    )

    assert es.id == "es"
    assert len(es.dataframe_dict) == 1
    assert es["table"] is df

    assert es["table"].ww.schema is not None

    assert es["table"].ww.index == "id"
    assert es["table"].ww.time_index is None

    assert isinstance(es["table"].ww.logical_types["id"], NaturalLanguage)
    assert isinstance(es["table"].ww.logical_types["category"], NaturalLanguage)

    assert es["table"].ww.semantic_tags["id"] == {"index"}
    assert es["table"].ww.semantic_tags["category"] == {"new_tag"}


def test_init_es_with_multiple_dataframes(df):
    second_df = pd.DataFrame({"id": [0, 1, 2, 3], "first_table_id": [1, 2, 2, 1]})

    df.ww.init(name="first_table", index="id")

    es = EntitySet(
        "es",
        dataframes={
            "first_table": (df,),
            "second_table": (
                second_df,
                "id",
                None,
                None,
                {"first_table_id": "foreign_key"},
            ),
        },
    )

    assert len(es.dataframe_dict) == 2
    assert es["first_table"].ww.schema is not None
    assert es["second_table"].ww.schema is not None


def test_add_dataframe_to_es(df):
    es1 = EntitySet("es")
    assert es1.dataframe_dict == {}
    es1.add_dataframe(
        df,
        dataframe_name="table",
        index="id",
        semantic_tags={"category": "new_tag"},
    )
    assert len(es1.dataframe_dict) == 1

    copy_df = df.ww.copy()

    es2 = EntitySet("es")
    assert es2.dataframe_dict == {}
    es2.add_dataframe(copy_df)
    assert len(es2.dataframe_dict) == 1

    assert es1["table"].ww == es2["table"].ww


def test_change_es_dataframe_schema(df):
    df.ww.init(index="id", name="table")
    es = EntitySet("es", dataframes={"table": (df,)})

    assert es["table"].ww.index == "id"

    es["table"].ww.set_index("category")
    assert es["table"].ww.index == "category"


def test_init_es_with_relationships(df):
    second_df = pd.DataFrame({"id": [0, 1, 2, 3], "first_table_id": [1, 2, 2, 1]})

    df.ww.init(name="first_table", index="id")
    second_df.ww.init(name="second_table", index="id")

    es = EntitySet(
        "es",
        dataframes={"first_table": (df,), "second_table": (second_df,)},
        relationships=[("first_table", "id", "second_table", "first_table_id")],
    )

    assert len(es.relationships) == 1

    forward_dataframes = [name for name, _ in es.get_forward_dataframes("second_table")]
    assert forward_dataframes[0] == "first_table"

    relationship = es.relationships[0]
    assert "foreign_key" in relationship.child_column.ww.semantic_tags
    assert "index" in relationship.parent_column.ww.semantic_tags


@pytest.fixture
def dates_df():
    return pd.DataFrame(
        {
            "backwards_order": [8, 7, 6, 5, 4, 3, 2, 1, 0],
            "dates_backwards": [
                "2020-09-09",
                "2020-09-08",
                "2020-09-07",
                "2020-09-06",
                "2020-09-05",
                "2020-09-04",
                "2020-09-03",
                "2020-09-02",
                "2020-09-01",
            ],
            "random_order": [7, 6, 8, 0, 2, 4, 3, 1, 5],
            "repeating_dates": [
                "2020-08-01",
                "2019-08-01",
                "2020-08-01",
                "2012-08-01",
                "2019-08-01",
                "2019-08-01",
                "2019-08-01",
                "2013-08-01",
                "2019-08-01",
            ],
            "special": [7, 8, 0, 1, 4, 2, 6, 3, 5],
            "special_dates": [
                "2020-08-01",
                "2019-08-01",
                "2020-08-01",
                "2012-08-01",
                "2019-08-01",
                "2019-08-01",
                "2019-08-01",
                "2013-08-01",
                "2019-08-01",
            ],
        },
    )


def test_add_secondary_time_index(dates_df):
    dates_df.ww.init(
        name="dates_table",
        index="backwards_order",
        time_index="dates_backwards",
    )
    es = EntitySet("es")
    es.add_dataframe(
        dates_df,
        secondary_time_index={"repeating_dates": ["random_order", "special"]},
    )

    assert dates_df.ww.metadata["secondary_time_index"] == {
        "repeating_dates": ["random_order", "special", "repeating_dates"],
    }


def test_time_type_check_order(dates_df):
    dates_df.ww.init(
        name="dates_table",
        index="backwards_order",
        time_index="random_order",
    )
    es = EntitySet("es")

    error = "dates_table time index is Datetime type which differs from other entityset time indexes"
    with pytest.raises(TypeError, match=error):
        es.add_dataframe(
            dates_df,
            secondary_time_index={"repeating_dates": ["random_order", "special"]},
        )

    assert "secondary_time_index" not in dates_df.ww.metadata


def test_add_time_index_through_woodwork_different_type(dates_df):
    dates_df.ww.init(
        name="dates_table",
        index="backwards_order",
        time_index="dates_backwards",
    )
    es = EntitySet("es")

    es.add_dataframe(
        dates_df,
        secondary_time_index={"repeating_dates": ["random_order", "special"]},
    )

    assert dates_df.ww.metadata["secondary_time_index"] == {
        "repeating_dates": ["random_order", "special", "repeating_dates"],
    }
    assert es.time_type == Datetime

    assert es._check_uniform_time_index(es["dates_table"]) is None

    dates_df.ww.set_time_index("random_order")
    assert dates_df.ww.time_index == "random_order"

    error = "dates_table time index is numeric type which differs from other entityset time indexes"
    with pytest.raises(TypeError, match=error):
        es._check_uniform_time_index(es["dates_table"])


def test_init_with_mismatched_time_types(dates_df):
    dates_df.ww.init(
        name="dates_table",
        index="backwards_order",
        time_index="repeating_dates",
    )
    es = EntitySet("es")
    es.add_dataframe(dates_df, secondary_time_index={"special_dates": ["special"]})
    assert es.time_type == Datetime

    nums_df = pd.DataFrame({"id": [1, 2, 3], "times": [9, 8, 7]})
    nums_df.ww.init(name="numerics_table", index="id", time_index="times")

    error = "numerics_table time index is numeric type which differs from other entityset time indexes"
    with pytest.raises(TypeError, match=error):
        es.add_dataframe(nums_df)


def test_int_double_time_type(dates_df):
    dates_df.ww.init(
        name="dates_table",
        index="backwards_order",
        time_index="random_order",
        logical_types={"random_order": "Integer", "special": "Double"},
    )
    es = EntitySet("es")

    # Both random_order and special are numeric, but they are different logical types
    es.add_dataframe(dates_df, secondary_time_index={"special": ["dates_backwards"]})

    assert isinstance(es["dates_table"].ww.logical_types["random_order"], Integer)
    assert isinstance(es["dates_table"].ww.logical_types["special"], Double)

    assert es["dates_table"].ww.time_index == "random_order"
    assert "special" in es["dates_table"].ww.metadata["secondary_time_index"]


def test_normalize_dataframe():
    df = pd.DataFrame(
        {
            "id": range(4),
            "full_name": [
                "Mr. John Doe",
                "Doe, Mrs. Jane",
                "James Brown",
                "Ms. Paige Turner",
            ],
            "email": [
                "john.smith@example.com",
                np.nan,
                "team@featuretools.com",
                "junk@example.com",
            ],
            "phone_number": [
                "5555555555",
                "555-555-5555",
                "1-(555)-555-5555",
                "555-555-5555",
            ],
            "age": pd.Series([33, None, 33, 57], dtype="Int64"),
            "signup_date": [pd.to_datetime("2020-09-01")] * 4,
            "is_registered": pd.Series([True, False, True, None], dtype="boolean"),
        },
    )

    df.ww.init(name="first_table", index="id", time_index="signup_date")
    es = EntitySet("es")
    es.add_dataframe(df)
    es.normalize_dataframe(
        "first_table",
        "second_table",
        "age",
        additional_columns=["phone_number", "full_name"],
        make_time_index=True,
    )
    assert len(es.dataframe_dict) == 2
    assert "foreign_key" in es["first_table"].ww.semantic_tags["age"]


def test_replace_dataframe():
    df = pd.DataFrame(
        {
            "id": range(4),
            "full_name": [
                "Mr. John Doe",
                "Doe, Mrs. Jane",
                "James Brown",
                "Ms. Paige Turner",
            ],
            "email": [
                "john.smith@example.com",
                np.nan,
                "team@featuretools.com",
                "junk@example.com",
            ],
            "phone_number": [
                "5555555555",
                "555-555-5555",
                "1-(555)-555-5555",
                "555-555-5555",
            ],
            "age": pd.Series([33, None, 33, 57], dtype="Int64"),
            "signup_date": [pd.to_datetime("2020-09-01")] * 4,
            "is_registered": pd.Series([True, False, True, None], dtype="boolean"),
        },
    )

    df.ww.init(name="table", index="id")
    es = EntitySet("es")
    es.add_dataframe(df)
    original_schema = es["table"].ww.schema

    new_df = df.iloc[2:]
    es.replace_dataframe("table", new_df)

    assert len(es["table"]) == 2
    assert es["table"].ww.schema == original_schema


def test_add_last_time_index(es):
    es.add_last_time_indexes(["products"])

    assert "last_time_index" in es["products"].ww.metadata

    assert es["products"].ww.metadata["last_time_index"] == LTI_COLUMN_NAME
    assert LTI_COLUMN_NAME in es["products"]
    assert "last_time_index" in es["products"].ww.semantic_tags[LTI_COLUMN_NAME]
    assert isinstance(es["products"].ww.logical_types[LTI_COLUMN_NAME], Datetime)


def test_lti_already_has_last_time_column_name(es):
    col = es["customers"].ww.pop("loves_ice_cream")
    col.name = LTI_COLUMN_NAME

    es["customers"].ww[LTI_COLUMN_NAME] = col

    assert LTI_COLUMN_NAME in es["customers"].columns
    assert isinstance(es["customers"].ww.logical_types[LTI_COLUMN_NAME], Boolean)

    error = (
        "Cannot add a last time index on DataFrame with an existing "
        f"'{LTI_COLUMN_NAME}' column. Please rename '{LTI_COLUMN_NAME}'."
    )
    with pytest.raises(ValueError, match=error):
        es.add_last_time_indexes(["customers"])


def test_numeric_es_last_time_index_logical_type(int_es):
    assert int_es.time_type == "numeric"

    int_es.add_last_time_indexes()

    for df in int_es.dataframes:
        assert isinstance(df.ww.logical_types[LTI_COLUMN_NAME], Double)
        int_es._check_uniform_time_index(df, LTI_COLUMN_NAME)


def test_datetime_es_last_time_index_logical_type(es):
    assert es.time_type == Datetime

    es.add_last_time_indexes()

    for df in es.dataframes:
        assert isinstance(df.ww.logical_types[LTI_COLUMN_NAME], Datetime)
        es._check_uniform_time_index(df, LTI_COLUMN_NAME)


def test_dataframe_without_name(es):
    new_es = EntitySet()

    new_df = es["sessions"].copy()

    assert new_df.ww.schema is None

    error = "Cannot add dataframe to EntitySet without a name. Please provide a value for the dataframe_name parameter."
    with pytest.raises(ValueError, match=error):
        new_es.add_dataframe(new_df)


def test_dataframe_with_name_parameter(es):
    new_es = EntitySet()

    new_df = es["sessions"][["id"]]

    assert new_df.ww.schema is None

    new_es.add_dataframe(
        new_df,
        dataframe_name="df_name",
        index="id",
        logical_types={"id": "Integer"},
    )
    assert new_es["df_name"].ww.name == "df_name"


def test_woodwork_dataframe_without_name_errors(es):
    new_es = EntitySet()

    new_df = es["sessions"].ww.copy()
    new_df.ww._schema.name = None

    assert new_df.ww.name is None

    error = "Cannot add a Woodwork DataFrame to EntitySet without a name"
    with pytest.raises(ValueError, match=error):
        new_es.add_dataframe(new_df)


def test_woodwork_dataframe_with_name(es):
    new_es = EntitySet()

    new_df = es["sessions"].ww.copy()
    new_df.ww._schema.name = "df_name"

    assert new_df.ww.name == "df_name"

    new_es.add_dataframe(new_df)

    assert new_es["df_name"].ww.name == "df_name"


def test_woodwork_dataframe_ignore_conflicting_name_parameter_warning(es):
    new_es = EntitySet()

    new_df = es["sessions"].ww.copy()
    new_df.ww._schema.name = "df_name"

    assert new_df.ww.name == "df_name"

    warning = "A Woodwork-initialized DataFrame was provided, so the following parameters were ignored: dataframe_name"
    with pytest.warns(UserWarning, match=warning):
        new_es.add_dataframe(new_df, dataframe_name="conflicting_name")

    assert new_es["df_name"].ww.name == "df_name"


def test_woodwork_dataframe_same_name_parameter(es):
    new_es = EntitySet()

    new_df = es["sessions"].ww.copy()
    new_df.ww._schema.name = "df_name"

    assert new_df.ww.name == "df_name"

    new_es.add_dataframe(new_df, dataframe_name="df_name")

    assert new_es["df_name"].ww.name == "df_name"


def test_extra_woodwork_params(es):
    new_es = EntitySet()

    sessions_df = es["sessions"].ww.copy()

    assert sessions_df.ww.index == "id"
    assert sessions_df.ww.time_index is None
    assert isinstance(sessions_df.ww.logical_types["id"], Integer)

    warning_msg = (
        "A Woodwork-initialized DataFrame was provided, so the following parameters were ignored: "
        "index, time_index, logical_types, make_index, semantic_tags, already_sorted"
    )
    with pytest.warns(UserWarning, match=warning_msg):
        new_es.add_dataframe(
            dataframe_name="sessions",
            dataframe=sessions_df,
            index="filepath",
            time_index="customer_id",
            logical_types={"id": Categorical},
            make_index=True,
            already_sorted=True,
            semantic_tags={"id": "new_tag"},
        )
    assert sessions_df.ww.index == "id"
    assert sessions_df.ww.time_index is None
    assert isinstance(sessions_df.ww.logical_types["id"], Integer)
    assert "new_tag" not in sessions_df.ww.semantic_tags


def test_replace_dataframe_errors(es):
    df = es["customers"].copy()
    df["new"] = pd.Series([1, 2, 3])

    error_text = "New dataframe is missing new cohort column"
    with pytest.raises(ValueError, match=error_text):
        es.replace_dataframe(dataframe_name="customers", df=df.drop(columns=["cohort"]))

    error_text = "New dataframe contains 16 columns, expecting 15"
    with pytest.raises(ValueError, match=error_text):
        es.replace_dataframe(dataframe_name="customers", df=df)


def test_replace_dataframe_already_sorted(es):
    # test already_sorted on dataframe without time index
    df = es["sessions"].copy()
    updated_id = df["id"]
    updated_id.iloc[1] = 2
    updated_id.iloc[2] = 1

    df = df.set_index("id", drop=False)
    df.index.name = None
    es.replace_dataframe(dataframe_name="sessions", df=df.copy(), already_sorted=False)
    sessions_df = es["sessions"]
    assert sessions_df["id"].iloc[1] == 2  # no sorting since time index not defined
    es.replace_dataframe(dataframe_name="sessions", df=df.copy(), already_sorted=True)
    sessions_df = es["sessions"]
    assert sessions_df["id"].iloc[1] == 2

    # test already_sorted on dataframe with time index
    df = es["customers"].copy()
    updated_signup = df["signup_date"]
    updated_signup.iloc[0] = datetime(2011, 4, 11)

    assert es["customers"].ww.time_index == "signup_date"

    df["signup_date"] = updated_signup

    es.replace_dataframe(dataframe_name="customers", df=df.copy(), already_sorted=True)
    customers_df = es["customers"]
    assert customers_df["id"].iloc[0] == 2

    es.replace_dataframe(dataframe_name="customers", df=df.copy(), already_sorted=False)
    updated_customers = es["customers"]
    assert updated_customers["id"].iloc[0] == 0


def test_replace_dataframe_invalid_schema(es):
    df = es["customers"].copy()
    df["id"] = pd.Series([1, 1, 1])

    error_text = "Index column must be unique"
    with pytest.raises(IndexError, match=error_text):
        es.replace_dataframe(dataframe_name="customers", df=df)


def test_replace_dataframe_mismatched_index(es):
    df = es["customers"].copy()
    df["id"] = pd.Series([99, 88, 77])

    es.replace_dataframe(dataframe_name="customers", df=df)

    assert all([77, 99, 88] == es["customers"]["id"])
    assert all([77, 99, 88] == (es["customers"]["id"]).index)


def test_replace_dataframe_different_dtypes(es):
    float_dtype_df = es["customers"].copy()
    float_dtype_df = float_dtype_df.astype({"age": "float64"})

    es.replace_dataframe(dataframe_name="customers", df=float_dtype_df)

    assert es["customers"]["age"].dtype == "int64"
    assert isinstance(es["customers"].ww.logical_types["age"], Integer)

    incompatible_dtype_df = es["customers"].copy()
    incompatible_list = ["hi", "bye", "bye"]
    incompatible_dtype_df["age"] = pd.Series(incompatible_list)

    error_msg = "Error converting datatype for age from type object to type int64. Please confirm the underlying data is consistent with logical type Integer."
    with pytest.raises(TypeConversionError, match=error_msg):
        es.replace_dataframe(dataframe_name="customers", df=incompatible_dtype_df)


@pytest.fixture()
def latlong_df():
    latlong_df = pd.DataFrame(
        {
            "tuples": pd.Series([(1, 2), (3, 4)]),
            "string_tuple": pd.Series(["(1, 2)", "(3, 4)"]),
            "bracketless_string_tuple": pd.Series(["1, 2", "3, 4"]),
            "list_strings": pd.Series([["1", "2"], ["3", "4"]]),
            "combo_tuple_types": pd.Series(["[1, 2]", "(3, 4)"]),
        },
    )
    latlong_df.set_index("string_tuple", drop=False, inplace=True)
    latlong_df.index.name = None
    return latlong_df


def test_replace_dataframe_data_transformation(latlong_df):
    initial_df = latlong_df.copy()
    initial_df.ww.init(
        name="latlongs",
        index="string_tuple",
        logical_types={col_name: "LatLong" for col_name in initial_df.columns},
    )
    es = EntitySet()
    es.add_dataframe(dataframe=initial_df)

    df = es["latlongs"]
    expected_val = (1, 2)
    for col in latlong_df.columns:
        series = df[col]
        assert series.iloc[0] == expected_val

    es.replace_dataframe("latlongs", latlong_df)
    df = es["latlongs"]
    expected_val = (3, 4)
    for col in latlong_df.columns:
        series = df[col]
        assert series.iloc[-1] == expected_val


def test_replace_dataframe_column_order(es):
    original_column_order = es["customers"].columns.copy()

    df = es["customers"].copy()
    col = df.pop("cohort")
    df[col.name] = col

    assert not df.columns.equals(original_column_order)
    assert set(df.columns) == set(original_column_order)

    es.replace_dataframe(dataframe_name="customers", df=df)

    assert es["customers"].columns.equals(original_column_order)


def test_replace_dataframe_different_woodwork_initialized(es):
    df = es["customers"].copy()
    df["age"] = pd.Series([1, 2, 3])

    # Initialize Woodwork on the new DataFrame and change the schema so it won't match the original DataFrame's schema
    df.ww.init(schema=es["customers"].ww.schema)
    df.ww.set_types(
        logical_types={"id": "NaturalLanguage", "cancel_date": "NaturalLanguage"},
    )
    assert df["id"].dtype == "string"
    assert df["cancel_date"].dtype == "string"

    assert es["customers"]["id"].dtype == "int64"
    assert es["customers"]["cancel_date"].dtype == "datetime64[ns]"

    original_schema = es["customers"].ww.schema

    warning = "Woodwork typing information on new dataframe will be replaced with existing typing information from customers"
    with pytest.warns(UserWarning, match=warning):
        es.replace_dataframe("customers", df, already_sorted=True)

    actual = es["customers"]["age"].sort_values()
    assert all(actual == [1, 2, 3])

    assert es["customers"].ww._schema == original_schema
    assert es["customers"]["id"].dtype == "int64"
    assert es["customers"]["cancel_date"].dtype == "datetime64[ns]"


def test_replace_dataframe_and_min_last_time_index(es):
    es.add_last_time_indexes(["products"])

    original_time_index = es["log"]["datetime"].copy()
    original_last_time_index = es["products"][LTI_COLUMN_NAME].copy()

    new_time_index = original_time_index + pd.Timedelta(days=1)
    expected_last_time_index = original_last_time_index + pd.Timedelta(days=1)

    new_dataframe = es["log"].copy()
    new_dataframe["datetime"] = new_time_index
    new_dataframe.pop(LTI_COLUMN_NAME)

    es.replace_dataframe("log", new_dataframe, recalculate_last_time_indexes=True)

    pd.testing.assert_series_equal(
        es["products"][LTI_COLUMN_NAME].sort_index(),
        expected_last_time_index.sort_index(),
    )
    pd.testing.assert_series_equal(
        es["log"][LTI_COLUMN_NAME].sort_index(),
        new_time_index.sort_index(),
        check_names=False,
    )


def test_replace_dataframe_dont_recalculate_last_time_index_present(es):
    es.add_last_time_indexes()

    original_time_index = es["customers"]["signup_date"].copy()
    original_last_time_index = es["customers"][LTI_COLUMN_NAME].copy()

    new_time_index = original_time_index + pd.Timedelta(days=10)

    new_dataframe = es["customers"].copy()
    new_dataframe["signup_date"] = new_time_index

    es.replace_dataframe(
        "customers",
        new_dataframe,
        recalculate_last_time_indexes=False,
    )
    pd.testing.assert_series_equal(
        es["customers"][LTI_COLUMN_NAME],
        original_last_time_index,
    )


def test_replace_dataframe_dont_recalculate_last_time_index_not_present(es):
    es.add_last_time_indexes()
    original_lti_name = es["customers"].ww.metadata.get("last_time_index")
    assert original_lti_name is not None

    original_time_index = es["customers"]["signup_date"].copy()

    new_time_index = original_time_index + pd.Timedelta(days=10)

    new_dataframe = es["customers"].copy()
    new_dataframe["signup_date"] = new_time_index
    new_dataframe.pop(LTI_COLUMN_NAME)

    es.replace_dataframe(
        "customers",
        new_dataframe,
        recalculate_last_time_indexes=False,
    )
    assert "last_time_index" not in es["customers"].ww.metadata
    assert original_lti_name not in es["customers"].columns


def test_replace_dataframe_recalculate_last_time_index_not_present(es):
    es.add_last_time_indexes()

    original_time_index = es["log"]["datetime"].copy()

    new_time_index = original_time_index + pd.Timedelta(days=10)

    new_dataframe = es["log"].copy()
    new_dataframe["datetime"] = new_time_index
    new_dataframe.pop(LTI_COLUMN_NAME)

    es.replace_dataframe("log", new_dataframe, recalculate_last_time_indexes=True)
    pd.testing.assert_series_equal(
        es["log"]["datetime"].sort_index(),
        new_time_index.sort_index(),
        check_names=False,
    )
    pd.testing.assert_series_equal(
        es["log"][LTI_COLUMN_NAME].sort_index(),
        new_time_index.sort_index(),
        check_names=False,
    )


def test_replace_dataframe_recalculate_last_time_index_present(es):
    es.add_last_time_indexes()

    original_time_index = es["log"]["datetime"].copy()

    new_time_index = original_time_index + pd.Timedelta(days=10)

    new_dataframe = es["log"].copy()
    new_dataframe["datetime"] = new_time_index
    assert LTI_COLUMN_NAME in new_dataframe.columns

    es.replace_dataframe("log", new_dataframe, recalculate_last_time_indexes=True)
    pd.testing.assert_series_equal(
        es["log"]["datetime"].sort_index(),
        new_time_index.sort_index(),
        check_names=False,
    )
    pd.testing.assert_series_equal(
        es["log"][LTI_COLUMN_NAME].sort_index(),
        new_time_index.sort_index(),
        check_names=False,
    )


def test_normalize_dataframe_loses_column_metadata(es):
    es["log"].ww.columns["value"].metadata["interesting_values"] = [0.0, 1.0]
    es["log"].ww.columns["priority_level"].metadata["interesting_values"] = [1]

    es["log"].ww.columns["value"].description = "a value column"
    es["log"].ww.columns["priority_level"].description = "a priority level column"

    assert "interesting_values" in es["log"].ww.columns["priority_level"].metadata
    assert "interesting_values" in es["log"].ww.columns["value"].metadata
    assert es["log"].ww.columns["value"].description == "a value column"
    assert (
        es["log"].ww.columns["priority_level"].description == "a priority level column"
    )

    es.normalize_dataframe(
        "log",
        "values_2",
        "value_2",
        additional_columns=["priority_level"],
        copy_columns=["value"],
        make_time_index=False,
    )

    # Metadata in the original dataframe and the new dataframe are maintained
    assert "interesting_values" in es["log"].ww.columns["value"].metadata
    assert "interesting_values" in es["values_2"].ww.columns["value"].metadata
    assert "interesting_values" in es["values_2"].ww.columns["priority_level"].metadata
    assert es["log"].ww.columns["value"].description == "a value column"
    assert es["values_2"].ww.columns["value"].description == "a value column"
    assert (
        es["values_2"].ww.columns["priority_level"].description
        == "a priority level column"
    )


def test_normalize_ww_init():
    es = EntitySet()
    df = pd.DataFrame(
        {
            "id": [1, 2, 3, 4],
            "col": ["a", "b", "c", "d"],
            "df2_id": [1, 1, 2, 2],
            "df2_col": [True, False, True, True],
        },
    )

    df.ww.init(index="id", name="test_name")
    es.add_dataframe(dataframe=df)

    assert es["test_name"].ww.name == "test_name"
    assert es["test_name"].ww.schema.name == "test_name"

    es.normalize_dataframe(
        "test_name",
        "new_df",
        "df2_id",
        additional_columns=["df2_col"],
    )

    assert es["test_name"].ww.name == "test_name"
    assert es["test_name"].ww.schema.name == "test_name"

    assert es["new_df"].ww.name == "new_df"
    assert es["new_df"].ww.schema.name == "new_df"


================================================
FILE: featuretools/tests/entry_point_tests/__init__.py
================================================


================================================
FILE: featuretools/tests/entry_point_tests/add-ons/__init__.py
================================================


================================================
FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_plugin/__init__.py
================================================


================================================
FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_plugin/featuretools_plugin/__init__.py
================================================
raise NotImplementedError("plugin not implemented")


================================================
FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_plugin/setup.py
================================================
from setuptools import setup

setup(
    name="featuretools_plugin",
    packages=["featuretools_plugin"],
    entry_points={
        "featuretools_plugin": [
            "module = featuretools_plugin",
        ],
    },
)


================================================
FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/__init__.py
================================================


================================================
FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/featuretools_primitives/__init__.py
================================================


================================================
FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/featuretools_primitives/existing_primitive.py
================================================
from featuretools.primitives.base import AggregationPrimitive


class Sum(AggregationPrimitive):
    """A primitive that should currently exist for testing."""

    pass


================================================
FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/featuretools_primitives/invalid_primitive.py
================================================
raise NotImplementedError("invalid primitive")


================================================
FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/featuretools_primitives/new_primitive.py
================================================
from featuretools.primitives.base import TransformPrimitive


class NewPrimitive(TransformPrimitive):
    """A primitive that should not currently exist for testing."""

    pass


================================================
FILE: featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/setup.py
================================================
from setuptools import find_packages, setup

setup(
    name="featuretools_primitives",
    packages=find_packages(),
    entry_points={
        "featuretools_primitives": [
            "new = featuretools_primitives.new_primitive",
            "invalid = featuretools_primitives.invalid_primitive",
            "existing = featuretools_primitives.existing_primitive",
        ],
    },
)


================================================
FILE: featuretools/tests/entry_point_tests/test_plugin.py
================================================
from featuretools.tests.entry_point_tests.utils import (
    _import_featuretools,
    _install_featuretools_plugin,
    _uninstall_featuretools_plugin,
)


def test_plugin_warning():
    _install_featuretools_plugin()
    warning = _import_featuretools("warning").stdout.decode()
    debug = _import_featuretools("debug").stdout.decode()
    _uninstall_featuretools_plugin()

    message = (
        "Featuretools failed to load plugin module from library featuretools_plugin"
    )
    traceback = "NotImplementedError: plugin not implemented"

    assert message in warning
    assert traceback not in warning
    assert message in debug
    assert traceback in debug


================================================
FILE: featuretools/tests/entry_point_tests/test_primitives.py
================================================
from featuretools.tests.entry_point_tests.utils import (
    _import_featuretools,
    _install_featuretools_primitives,
    _python,
    _uninstall_featuretools_primitives,
)


def test_entry_point():
    _install_featuretools_primitives()
    featuretools_log = _import_featuretools("debug").stdout.decode()
    new_primitive = _python("-c", "from featuretools.primitives import NewPrimitive")
    _uninstall_featuretools_primitives()
    assert new_primitive.returncode == 0

    invalid_primitive = 'Featuretools failed to load "invalid" primitives from "featuretools_primitives.invalid_primitive". '
    invalid_primitive += "For a full stack trace, set logging to debug."
    assert invalid_primitive in featuretools_log

    existing_primitive = 'While loading primitives via "existing" entry point, '
    existing_primitive += 'ignored primitive "Sum" from "featuretools_primitives.existing_primitive" because a primitive '
    existing_primitive += 'with that name already exists in "featuretools.primitives.standard.aggregation.sum_primitive"'
    assert existing_primitive in featuretools_log


================================================
FILE: featuretools/tests/entry_point_tests/utils.py
================================================
import os
import subprocess
import sys


def _get_path_to_add_ons(*args):
    pwd = os.path.dirname(__file__)
    return os.path.join(pwd, "add-ons", *args)


def _python(*args):
    command = [sys.executable, *args]
    return subprocess.run(command, stdout=subprocess.PIPE)


def _install_featuretools_plugin():
    os.chdir(_get_path_to_add_ons("featuretools_plugin"))
    return _python("-m", "pip", "install", "-e", ".")


def _uninstall_featuretools_plugin():
    return _python("-m", "pip", "uninstall", "featuretools_plugin", "-y")


def _install_featuretools_primitives():
    os.chdir(_get_path_to_add_ons("featuretools_primitives"))
    return _python("-m", "pip", "install", "-e", ".")


def _uninstall_featuretools_primitives():
    return _python("-m", "pip", "uninstall", "featuretools_primitives", "-y")


def _import_featuretools(level=None):
    c = ""
    if level:
        c += "import os;"
        c += 'os.environ["FEATURETOOLS_LOG_LEVEL"] = "%s";' % level

    c += "import featuretools;"
    return _python("-c", c)


================================================
FILE: featuretools/tests/feature_discovery/__init__.py
================================================


================================================
FILE: featuretools/tests/feature_discovery/test_convertors.py
================================================
from woodwork.logical_types import Double, NaturalLanguage

from featuretools.entityset.entityset import EntitySet
from featuretools.feature_base.feature_base import (
    FeatureBase,
    IdentityFeature,
    TransformFeature,
)
from featuretools.feature_discovery.convertors import (
    _convert_feature_to_featurebase,
    convert_feature_list_to_featurebase_list,
    convert_featurebase_list_to_feature_list,
)
from featuretools.feature_discovery.feature_discovery import (
    generate_features_from_primitives,
    schema_to_features,
)
from featuretools.feature_discovery.LiteFeature import (
    LiteFeature,
)
from featuretools.primitives import Absolute, AddNumeric, Lag
from featuretools.synthesis import dfs
from featuretools.tests.feature_discovery.test_feature_discovery import (
    MultiOutputPrimitiveForTest,
)
from featuretools.tests.testing_utils.generate_fake_dataframe import (
    generate_fake_dataframe,
)


def test_convert_featurebase_list_to_feature_list():
    col_defs = [
        ("idx", "Integer", {"index"}),
        ("f_1", "Double"),
        ("f_2", "Double"),
        ("f_3", "NaturalLanguage"),
    ]

    df = generate_fake_dataframe(
        col_defs=col_defs,
    )

    es = EntitySet(id="es")
    es.add_dataframe(df, df.ww.name)

    fdefs = dfs(
        entityset=es,
        target_dataframe_name=df.ww.name,
        trans_primitives=[AddNumeric, MultiOutputPrimitiveForTest],
        features_only=True,
        max_depth=1,
    )
    assert isinstance(fdefs, list)
    assert isinstance(fdefs[0], FeatureBase)

    converted_features = set(convert_featurebase_list_to_feature_list(fdefs))

    f1 = LiteFeature("f_1", Double)
    f2 = LiteFeature("f_2", Double)
    f3 = LiteFeature("f_3", NaturalLanguage)
    fadd = LiteFeature(
        name="f_1 + f_2",
        tags={"numeric"},
        primitive=AddNumeric(),
        base_features=[f1, f2],
    )
    fmo0 = LiteFeature(
        name="TEST_MO(f_3)[0]",
        tags={"numeric"},
        primitive=MultiOutputPrimitiveForTest(),
        base_features=[f3],
        idx=0,
    )
    fmo1 = LiteFeature(
        name="TEST_MO(f_3)[1]",
        tags={"numeric"},
        primitive=MultiOutputPrimitiveForTest(),
        base_features=[f3],
        idx=1,
    )
    fmo0.related_features = {fmo1}
    fmo1.related_features = {fmo0}

    orig_features = set([f1, f2, fadd, fmo0, fmo1])

    assert len(orig_features.symmetric_difference(converted_features)) == 0


def test_origin_feature_to_featurebase():
    df = generate_fake_dataframe(
        col_defs=[("idx", "Double", {"index"}), ("f_1", "Double")],
    )
    es = EntitySet(id="test")
    es.add_dataframe(df, df.ww.name)

    origin_features = schema_to_features(df.ww.schema)
    f_1 = [f for f in origin_features if f.name == "f_1"][0]
    fb = _convert_feature_to_featurebase(f_1, df, {})

    assert isinstance(fb, IdentityFeature)
    assert fb.get_name() == "f_1"

    f_1.set_alias("new name")
    df.ww.rename({"f_1": "new name"}, inplace=True)
    fb = _convert_feature_to_featurebase(f_1, df, {})

    assert isinstance(fb, IdentityFeature)
    assert fb.get_name() == "new name"


def test_stacked_feature_to_featurebase():
    df = generate_fake_dataframe(
        col_defs=[("idx", "Double", {"index"}), ("f_1", "Double")],
    )
    es = EntitySet(id="test")
    es.add_dataframe(df, df.ww.name)

    origin_features = schema_to_features(df.ww.schema)
    f_1 = [f for f in origin_features if f.name == "f_1"][0]
    features = generate_features_from_primitives([f_1], [Absolute()])

    f_2 = [f for f in features if f.name == "ABSOLUTE(f_1)"][0]

    fb = _convert_feature_to_featurebase(f_2, df, {})

    assert isinstance(fb, TransformFeature)
    assert fb.get_name() == "ABSOLUTE(f_1)"
    assert len(fb.base_features) == 1
    assert fb.base_features[0].get_name() == "f_1"

    f_2.set_alias("f_2")
    fb = _convert_feature_to_featurebase(f_2, df, {})

    assert isinstance(fb, TransformFeature)
    assert fb.get_name() == "f_2"
    assert len(fb.base_features) == 1
    assert fb.base_features[0].get_name() == "f_1"


def test_multi_output_to_featurebase():
    df = generate_fake_dataframe(
        col_defs=[
            ("idx", "Double", {"index"}),
            ("f_1", "NaturalLanguage"),
        ],
    )
    es = EntitySet(id="test")
    es.add_dataframe(df, df.ww.name)

    origin_features = schema_to_features(df.ww.schema)
    f_1 = [f for f in origin_features if f.name == "f_1"][0]
    features = generate_features_from_primitives([f_1], [MultiOutputPrimitiveForTest()])

    lsa_features = [f for f in features if f.get_primitive_name() == "test_mo"]
    assert len(lsa_features) == 2

    # Test Single LiteFeature
    fb = _convert_feature_to_featurebase(lsa_features[0], df, {})
    assert isinstance(fb, TransformFeature)
    assert fb.get_name() == "TEST_MO(f_1)"
    assert len(fb.base_features) == 1
    assert set(fb.get_feature_names()) == set(["TEST_MO(f_1)[0]", "TEST_MO(f_1)[1]"])
    assert fb.base_features[0].get_name() == "f_1"

    # Test that feature gets consolidated
    fb_list = convert_feature_list_to_featurebase_list(lsa_features, df)
    assert len(fb_list) == 1
    assert fb_list[0].get_name() == "TEST_MO(f_1)"
    assert len(fb_list[0].base_features) == 1
    assert set(fb_list[0].get_feature_names()) == set(
        ["TEST_MO(f_1)[0]", "TEST_MO(f_1)[1]"],
    )
    assert fb_list[0].base_features[0].get_name() == "f_1"

    lsa_features[0].set_alias("f_2")
    lsa_features[1].set_alias("f_3")

    fb = _convert_feature_to_featurebase(lsa_features[0], df, {})
    assert isinstance(fb, TransformFeature)
    assert len(fb.base_features) == 1
    assert set(fb.get_feature_names()) == set(["f_2", "f_3"])
    assert fb.base_features[0].get_name() == "f_1"

    # Test that feature gets consolidated
    fb_list = convert_feature_list_to_featurebase_list(lsa_features, df)
    assert len(fb_list) == 1
    assert len(fb_list[0].base_features) == 1
    assert set(fb_list[0].get_feature_names()) == set(["f_2", "f_3"])
    assert fb_list[0].base_features[0].get_name() == "f_1"


def test_stacking_on_multioutput_to_featurebase():
    col_defs = [
        ("idx", "Double", {"index"}),
        ("t_idx", "Datetime", {"time_index"}),
        ("f_1", "NaturalLanguage"),
    ]
    df = generate_fake_dataframe(
        col_defs=col_defs,
    )
    es = EntitySet(id="test")
    es.add_dataframe(df, df.ww.name)

    origin_features = schema_to_features(df.ww.schema)
    time_index_feature = [f for f in origin_features if f.name == "t_idx"][0]
    f_1 = [f for f in origin_features if f.name == "f_1"][0]

    features = generate_features_from_primitives([f_1], [MultiOutputPrimitiveForTest()])
    lsa_features = [f for f in features if f.get_primitive_name() == "test_mo"]
    assert len(lsa_features) == 2

    features = generate_features_from_primitives(
        lsa_features + [time_index_feature],
        [Lag(periods=2)],
    )
    lag_features = [f for f in features if f.get_primitive_name() == "lag"]
    assert len(lag_features) == 2

    fb_list = convert_feature_list_to_featurebase_list(lag_features, df)

    assert len(fb_list) == 2
    assert isinstance(fb_list[0], TransformFeature)
    assert set([x.get_name() for x in fb_list]) == set(
        [
            "LAG(TEST_MO(f_1)[0], t_idx, periods=2)",
            "LAG(TEST_MO(f_1)[1], t_idx, periods=2)",
        ],
    )

    lsa_features[0].set_alias("f_2")
    lsa_features[1].set_alias("f_3")
    features = generate_features_from_primitives(
        lsa_features + [time_index_feature],
        [Lag(periods=2)],
    )
    lag_features = [f for f in features if f.get_primitive_name() == "lag"]
    assert len(lag_features) == 2

    fb_list = convert_feature_list_to_featurebase_list(lag_features, df)
    assert len(fb_list) == 2
    assert isinstance(fb_list[0], TransformFeature)
    assert set([x.get_name() for x in fb_list]) == set(
        ["LAG(f_2, t_idx, periods=2)", "LAG(f_3, t_idx, periods=2)"],
    )


================================================
FILE: featuretools/tests/feature_discovery/test_feature_collection.py
================================================
import pytest
from woodwork.logical_types import (
    Boolean,
    Double,
    Ordinal,
)

from featuretools.feature_discovery.FeatureCollection import FeatureCollection
from featuretools.feature_discovery.LiteFeature import LiteFeature
from featuretools.primitives import Absolute, AddNumeric


@pytest.mark.parametrize(
    "feature_args, expected",
    [
        (
            ("idx", Double),
            ["ANY", "Double", "Double,numeric", "numeric"],
        ),
        (
            ("idx", Double, {"index"}),
            ["ANY", "Double", "Double,index", "index"],
        ),
        (
            ("idx", Double, {"other"}),
            [
                "ANY",
                "Double",
                "other",
                "numeric",
                "Double,other",
                "Double,numeric",
                "numeric,other",
                "Double,numeric,other",
            ],
        ),
        (
            ("idx", Ordinal, {"other"}),
            [
                "ANY",
                "Ordinal",
                "other",
                "category",
                "Ordinal,other",
                "Ordinal,category",
                "category,other",
                "Ordinal,category,other",
            ],
        ),
        (
            ("idx", Double, {"a", "b", "numeric"}),
            [
                "ANY",
                "Double",
                "a",
                "b",
                "numeric",
                "Double,a",
                "Double,b",
                "Double,numeric",
                "a,b",
                "a,numeric",
                "b,numeric",
                "a,b,numeric",
                "Double,a,b",
                "Double,a,numeric",
                "Double,b,numeric",
                "Double,a,b,numeric",
            ],
        ),
    ],
)
def test_to_keys_method(feature_args, expected):
    feature = LiteFeature(*feature_args)

    keys = FeatureCollection.feature_to_keys(feature)

    assert set(keys) == set(expected)


def test_feature_collection_hashing():
    f1 = LiteFeature(name="f1", logical_type=Double)
    f2 = LiteFeature(name="f2", logical_type=Double, tags={"index"})
    f3 = LiteFeature(name="f3", logical_type=Boolean, tags={"other"})
    f4 = LiteFeature(name="f4", primitive=Absolute(), base_features=[f1])
    f5 = LiteFeature(name="f5", primitive=AddNumeric(), base_features=[f1, f2])

    fc1 = FeatureCollection([f1, f2, f3, f4, f5])
    fc2 = FeatureCollection([f1, f2, f3, f4, f5])

    assert len(set([fc1, fc2])) == 1

    fc1.reindex()
    assert fc1.get_by_logical_type(Double) == set([f1, f2])

    assert fc1.get_by_tag("index") == set([f2])

    assert fc1.get_by_origin_feature(f1) == set([f1, f4, f5])

    assert fc1.get_dependencies_by_origin_name("f1") == set([f1, f4, f5])

    assert fc1.get_dependencies_by_origin_name("null") == set()

    assert fc1.get_by_origin_feature_name("f1") == f1

    assert fc1.get_by_origin_feature_name("null") is None


================================================
FILE: featuretools/tests/feature_discovery/test_feature_discovery.py
================================================
from unittest.mock import patch

import pytest
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import (
    Boolean,
    BooleanNullable,
    Datetime,
    Double,
    NaturalLanguage,
    Ordinal,
)

from featuretools.entityset.entityset import EntitySet
from featuretools.feature_discovery.feature_discovery import (
    _get_features,
    _get_matching_features,
    _index_column_set,
    generate_features_from_primitives,
    schema_to_features,
)
from featuretools.feature_discovery.FeatureCollection import FeatureCollection
from featuretools.feature_discovery.LiteFeature import (
    LiteFeature,
)
from featuretools.feature_discovery.utils import column_schema_to_keys
from featuretools.primitives import (
    Absolute,
    AddNumeric,
    Count,
    DateFirstEvent,
    Equal,
    Lag,
    MultiplyNumericBoolean,
    NumUnique,
    TransformPrimitive,
)
from featuretools.primitives.utils import get_transform_primitives
from featuretools.synthesis import dfs
from featuretools.tests.testing_utils.generate_fake_dataframe import (
    generate_fake_dataframe,
)

DEFAULT_LT_FOR_TAG = {
    "category": Ordinal,
    "numeric": Double,
    "time_index": Datetime,
}


class MultiOutputPrimitiveForTest(TransformPrimitive):
    name = "test_mo"
    input_types = [ColumnSchema(logical_type=NaturalLanguage)]
    return_type = ColumnSchema(semantic_tags={"numeric"})
    number_output_features = 2


class DoublePrimitiveForTest(TransformPrimitive):
    name = "test_double"
    input_types = [ColumnSchema(logical_type=Double)]
    return_type = ColumnSchema(logical_type=Double)


@pytest.mark.parametrize(
    "column_schema, expected",
    [
        (ColumnSchema(logical_type=Double), "Double"),
        (ColumnSchema(semantic_tags={"index"}), "index"),
        (
            ColumnSchema(logical_type=Double, semantic_tags={"index", "other"}),
            "Double,index,other",
        ),
    ],
)
def test_column_schema_to_keys(column_schema, expected):
    actual = column_schema_to_keys(column_schema)
    assert set(actual) == set(expected)


@pytest.mark.parametrize(
    "column_list, expected",
    [
        ([ColumnSchema(logical_type=Boolean)], [("Boolean", 1)]),
        ([ColumnSchema()], [("ANY", 1)]),
        (
            [
                ColumnSchema(logical_type=Boolean),
                ColumnSchema(logical_type=Boolean),
            ],
            [("Boolean", 2)],
        ),
    ],
)
def test_index_input_set(column_list, expected):
    actual = _index_column_set(column_list)

    assert actual == expected


@pytest.mark.parametrize(
    "feature_args, input_set, commutative, expected",
    [
        (
            [("f1", Boolean), ("f2", Boolean), ("f3", Boolean)],
            [ColumnSchema(logical_type=Boolean)],
            False,
            [["f1"], ["f2"], ["f3"]],
        ),
        (
            [("f1", Boolean), ("f2", Boolean)],
            [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)],
            False,
            [["f1", "f2"], ["f2", "f1"]],
        ),
        (
            [("f1", Boolean), ("f2", Boolean)],
            [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)],
            True,
            [["f1", "f2"]],
        ),
        (
            [("f1", Datetime, {"time_index"})],
            [ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"})],
            False,
            [["f1"]],
        ),
        (
            [("f1", Double, {"other", "index"})],
            [ColumnSchema(logical_type=Double, semantic_tags={"index", "other"})],
            False,
            [["f1"]],
        ),
        (
            [
                ("f1", Double),
                ("f2", Boolean),
                ("f3", Double),
                ("f4", Boolean),
                ("f5", Double),
            ],
            [
                ColumnSchema(logical_type=Double),
                ColumnSchema(logical_type=Double),
                ColumnSchema(logical_type=Boolean),
            ],
            True,
            [
                ["f1", "f3", "f2"],
                ["f1", "f3", "f4"],
                ["f1", "f5", "f2"],
                ["f1", "f5", "f4"],
                ["f3", "f5", "f2"],
                ["f3", "f5", "f4"],
            ],
        ),
    ],
)
@patch.object(LiteFeature, "_generate_hash", lambda x: x.name)
def test_get_features(feature_args, input_set, commutative, expected):
    features = [LiteFeature(*args) for args in feature_args]
    feature_collection = FeatureCollection(features).reindex()

    column_keys = _index_column_set(input_set)
    actual = _get_features(feature_collection, tuple(column_keys), commutative)

    assert set([tuple([y.id for y in x]) for x in actual]) == set(
        [tuple(x) for x in expected],
    )


@pytest.mark.parametrize(
    "feature_args, primitive, expected",
    [
        (
            [("f1", Double), ("f2", Double), ("f3", Double)],
            AddNumeric,
            [["f1", "f2"], ["f1", "f3"], ["f2", "f3"]],
        ),
        (
            [("f1", Boolean), ("f2", Boolean), ("f3", Boolean)],
            AddNumeric,
            [],
        ),
        (
            [("f7", Double), ("f8", Boolean)],
            MultiplyNumericBoolean,
            [["f7", "f8"]],
        ),
        (
            [("f9", Datetime)],
            DateFirstEvent,
            [],
        ),
        (
            [("f10", Datetime, {"time_index"})],
            DateFirstEvent,
            [["f10"]],
        ),
        (
            [("f11", Datetime, {"time_index"}), ("f12", Double)],
            NumUnique,
            [],
        ),
        (
            [("f13", Datetime, {"time_index"}), ("f14", Double), ("f15", Ordinal)],
            NumUnique,
            [["f15"]],
        ),
        (
            [("f16", Datetime, {"time_index"}), ("f17", Double), ("f18", Ordinal)],
            Equal,
            [["f16", "f17"], ["f16", "f18"], ["f17", "f18"]],
        ),
        (
            [
                ("t_idx", Datetime, {"time_index"}),
                ("f19", Ordinal),
                ("f20", Double),
                ("f21", Boolean),
                ("f22", BooleanNullable),
            ],
            Lag,
            [["f19", "t_idx"], ["f20", "t_idx"], ["f21", "t_idx"], ["f22", "t_idx"]],
        ),
        (
            [
                ("idx", Double, {"index"}),
                ("f23", Double),
            ],
            Count,
            [["idx"]],
        ),
        (
            [
                ("idx", Double, {"index"}),
                ("f23", Double),
            ],
            AddNumeric,
            [],
        ),
    ],
)
@patch.object(LiteFeature, "__lt__", lambda x, y: x.name < y.name)
def test_get_matching_features(feature_args, primitive, expected):
    features = [LiteFeature(*args) for args in feature_args]
    feature_collection = FeatureCollection(features).reindex()
    actual = _get_matching_features(feature_collection, primitive())
    assert [[y.name for y in x] for x in actual] == expected


@pytest.mark.parametrize(
    "col_defs, primitives, expected",
    [
        (
            [
                ("f_1", "Double"),
                ("f_2", "Double"),
                ("f_3", "Boolean"),
                ("f_4", "Double"),
            ],
            [AddNumeric],
            {"f_1 + f_2", "f_1 + f_4", "f_2 + f_4"},
        ),
        (
            [
                ("f_1", "Double"),
                ("f_2", "Double"),
            ],
            [Absolute],
            {"ABSOLUTE(f_1)", "ABSOLUTE(f_2)"},
        ),
    ],
)
@patch.object(LiteFeature, "__lt__", lambda x, y: x.name < y.name)
def test_generate_features_from_primitives(col_defs, primitives, expected):
    input_feature_names = set([x[0] for x in col_defs])
    df = generate_fake_dataframe(
        col_defs=col_defs,
    )

    origin_features = schema_to_features(df.ww.schema)
    features = generate_features_from_primitives(origin_features, primitives)

    new_feature_names = set([x.name for x in features]) - input_feature_names
    assert new_feature_names == expected


ALL_TRANSFORM_PRIMITIVES = list(get_transform_primitives().values())


@pytest.mark.parametrize(
    "col_defs, primitives",
    [
        (
            [
                ("idx", "Double", {"index"}),
                ("t_idx", "Datetime", {"time_index"}),
                ("f_3", "Boolean"),
                ("f_4", "Boolean"),
                ("f_5", "BooleanNullable"),
                ("f_6", "BooleanNullable"),
                ("f_7", "Categorical"),
                ("f_8", "Categorical"),
                ("f_9", "Datetime"),
                ("f_10", "Datetime"),
                ("f_11", "Double"),
                ("f_12", "Double"),
                ("f_13", "Integer"),
                ("f_14", "Integer"),
                ("f_15", "IntegerNullable"),
                ("f_16", "IntegerNullable"),
                ("f_17", "EmailAddress"),
                ("f_18", "EmailAddress"),
                ("f_19", "LatLong"),
                ("f_20", "LatLong"),
                ("f_21", "NaturalLanguage"),
                ("f_22", "NaturalLanguage"),
                ("f_23", "Ordinal"),
                ("f_24", "Ordinal"),
                ("f_25", "URL"),
                ("f_26", "URL"),
                ("f_27", "PostalCode"),
                ("f_28", "PostalCode"),
            ],
            ALL_TRANSFORM_PRIMITIVES,
        ),
    ],
)
@patch.object(LiteFeature, "_generate_hash", lambda x: x.name)
def test_compare_dfs(col_defs, primitives):
    input_feature_names = set([x[0] for x in col_defs])
    df = generate_fake_dataframe(
        col_defs=col_defs,
    )

    es = EntitySet(id="test")
    es.add_dataframe(df, "df")

    features_old = dfs(
        entityset=es,
        target_dataframe_name="df",
        trans_primitives=primitives,
        features_only=True,
        return_types="all",
        max_depth=1,
    )

    origin_features = schema_to_features(df.ww.schema)
    features = generate_features_from_primitives(origin_features, primitives)

    feature_names_old = set([x.get_name() for x in features_old]) - input_feature_names  # type: ignore

    feature_names_new = set([x.name for x in features]) - input_feature_names
    assert feature_names_old == feature_names_new


def test_generate_features_from_primitives_inputs():
    f1 = LiteFeature("f1", Double)
    with pytest.raises(
        ValueError,
        match="input_features must be an iterable of LiteFeature objects",
    ):
        generate_features_from_primitives(f1, [Absolute])

    with pytest.raises(
        ValueError,
        match="input_features must be an iterable of LiteFeature objects",
    ):
        generate_features_from_primitives([f1, "other"], [Absolute])

    with pytest.raises(
        ValueError,
        match="primitives must be a list of Primitive classes or Primitive instances",
    ):
        generate_features_from_primitives([f1], ["absolute"])

    with pytest.raises(
        ValueError,
        match="primitives must be a list of Primitive classes or Primitive instances",
    ):
        generate_features_from_primitives([f1], Absolute)


================================================
FILE: featuretools/tests/feature_discovery/test_type_defs.py
================================================
import json
from unittest.mock import patch

import pytest
from woodwork.logical_types import Boolean, Double

from featuretools.feature_discovery.feature_discovery import (
    generate_features_from_primitives,
    schema_to_features,
)
from featuretools.feature_discovery.FeatureCollection import FeatureCollection
from featuretools.feature_discovery.LiteFeature import LiteFeature
from featuretools.primitives import (
    Absolute,
    AddNumeric,
    DivideNumeric,
    Lag,
    MultiplyNumeric,
)
from featuretools.tests.feature_discovery.test_feature_discovery import (
    MultiOutputPrimitiveForTest,
)
from featuretools.tests.testing_utils.generate_fake_dataframe import (
    generate_fake_dataframe,
)


def test_feature_type_equality():
    f1 = LiteFeature("f1", Double)
    f2 = LiteFeature("f2", Double)

    # Add Numeric is Commutative, so should all be equal
    f3 = LiteFeature(
        name="Column 1",
        primitive=AddNumeric(),
        logical_type=Double,
        base_features=[f1, f2],
    )

    f4 = LiteFeature(
        name="Column 10",
        primitive=AddNumeric(),
        logical_type=Double,
        base_features=[f1, f2],
    )

    f5 = LiteFeature(
        name="Column 20",
        primitive=AddNumeric(),
        logical_type=Double,
        base_features=[f2, f1],
    )

    assert f3 == f4 == f5

    # Divide Numeric is not Commutative, so should not be equal
    f6 = LiteFeature(
        name="Column 1",
        primitive=DivideNumeric(),
        logical_type=Double,
        base_features=[f1, f2],
    )

    f7 = LiteFeature(
        name="Column 1",
        primitive=DivideNumeric(),
        logical_type=Double,
        base_features=[f2, f1],
    )

    assert f6 != f7


def test_feature_type_assertions():
    with pytest.raises(
        ValueError,
        match="there must be base features if given a primitive",
    ):
        LiteFeature(
            name="Column 1",
            primitive=AddNumeric(),
            logical_type=Double,
        )


@patch.object(LiteFeature, "_generate_hash", lambda x: x.name)
@patch(
    "featuretools.feature_discovery.LiteFeature.hash_primitive",
    lambda x: (x.name, None),
)
def test_feature_to_dict():
    f1 = LiteFeature("f1", Double)
    f2 = LiteFeature("f2", Double)
    f = LiteFeature(
        name="Column 1",
        primitive=AddNumeric(),
        base_features=[f1, f2],
    )

    expected = {
        "name": "Column 1",
        "logical_type": None,
        "tags": ["numeric"],
        "primitive": "add_numeric",
        "base_features": ["f1", "f2"],
        "df_id": None,
        "id": "Column 1",
        "related_features": [],
        "idx": 0,
    }

    actual = f.to_dict()
    json_str = json.dumps(actual)
    assert actual == expected
    assert json.dumps(expected) == json_str


def test_feature_hash():
    bf1 = LiteFeature("bf", Double)
    bf2 = LiteFeature("bf", Double, df_id="df")

    p1 = Lag(periods=1)
    p2 = Lag(periods=2)
    f1 = LiteFeature(
        primitive=p1,
        logical_type=Double,
        base_features=[bf1],
    )

    f2 = LiteFeature(
        primitive=p2,
        logical_type=Double,
        base_features=[bf1],
    )

    f3 = LiteFeature(
        primitive=p2,
        logical_type=Double,
        base_features=[bf1],
    )

    f4 = LiteFeature(
        primitive=p1,
        logical_type=Double,
        base_features=[bf2],
    )

    # TODO(dreed): ensure ID is parquet and arrow acceptable, length and starting character might be problematic

    assert f1 != f2
    assert f2 == f3
    assert f1 != f4


def test_feature_forced_name():
    bf = LiteFeature("bf", Double)

    p1 = Lag(periods=1)
    f1 = LiteFeature(
        name="target_delay_1",
        primitive=p1,
        logical_type=Double,
        base_features=[bf],
    )
    assert f1.name == "target_delay_1"


@patch.object(LiteFeature, "_generate_hash", lambda x: x.name)
@patch(
    "featuretools.feature_discovery.FeatureCollection.hash_primitive",
    lambda x: (x.name, None),
)
@patch(
    "featuretools.feature_discovery.LiteFeature.hash_primitive",
    lambda x: (x.name, None),
)
def test_feature_collection_to_dict():
    f1 = LiteFeature("f1", Double)
    f2 = LiteFeature("f2", Double)
    f3 = LiteFeature(
        name="Column 1",
        primitive=AddNumeric(),
        base_features=[f1, f2],
    )

    fc = FeatureCollection([f3])

    expected = {
        "primitives": {
            "add_numeric": None,
        },
        "feature_ids": ["Column 1"],
        "all_features": {
            "Column 1": {
                "name": "Column 1",
                "logical_type": None,
                "tags": ["numeric"],
                "primitive": "add_numeric",
                "base_features": ["f1", "f2"],
                "df_id": None,
                "id": "Column 1",
                "related_features": [],
                "idx": 0,
            },
            "f1": {
                "name": "f1",
                "logical_type": "Double",
                "tags": ["numeric"],
                "primitive": None,
                "base_features": [],
                "df_id": None,
                "id": "f1",
                "related_features": [],
                "idx": 0,
            },
            "f2": {
                "name": "f2",
                "logical_type": "Double",
                "tags": ["numeric"],
                "primitive": None,
                "base_features": [],
                "df_id": None,
                "id": "f2",
                "related_features": [],
                "idx": 0,
            },
        },
    }

    actual = fc.to_dict()
    assert actual == expected
    assert json.dumps(expected, sort_keys=True) == json.dumps(actual, sort_keys=True)


@patch.object(LiteFeature, "_generate_hash", lambda x: x.name)
def test_feature_collection_from_dict():
    f1 = LiteFeature("f1", Double)
    f2 = LiteFeature("f2", Double)
    f3 = LiteFeature(
        name="Column 1",
        primitive=AddNumeric(),
        base_features=[f1, f2],
    )

    expected = FeatureCollection([f3])

    input_dict = {
        "primitives": {
            "009da67f0a1430630c4a419c84aac270ec62337ab20c080e4495272950fd03b3": {
                "type": "AddNumeric",
                "module": "featuretools.primitives.standard.transform.binary.add_numeric",
                "arguments": {},
            },
        },
        "feature_ids": ["Column 1"],
        "all_features": {
            "f2": {
                "name": "f2",
                "logical_type": "Double",
                "tags": ["numeric"],
                "primitive": None,
                "base_features": [],
                "df_id": None,
                "id": "f2",
                "related_features": [],
                "idx": 0,
            },
            "f1": {
                "name": "f1",
                "logical_type": "Double",
                "tags": ["numeric"],
                "primitive": None,
                "base_features": [],
                "df_id": None,
                "id": "f1",
                "related_features": [],
                "idx": 0,
            },
            "Column 1": {
                "name": "Column 1",
                "logical_type": None,
                "tags": ["numeric"],
                "primitive": "009da67f0a1430630c4a419c84aac270ec62337ab20c080e4495272950fd03b3",
                "base_features": ["f1", "f2"],
                "df_id": None,
                "id": "Column 1",
                "related_features": [],
                "idx": 0,
            },
        },
    }

    actual = FeatureCollection.from_dict(input_dict)

    assert actual == expected


@patch.object(LiteFeature, "__lt__", lambda x, y: x.name < y.name)
def test_feature_collection_serialization_roundtrip():
    col_defs = [
        ("idx", "Integer", {"index"}),
        ("t_idx", "Datetime", {"time_index"}),
        ("f_1", "Double"),
        ("f_2", "Double"),
        ("f_3", "Categorical"),
        ("f_4", "Boolean"),
        ("f_5", "NaturalLanguage"),
    ]

    df = generate_fake_dataframe(
        col_defs=col_defs,
    )

    origin_features = schema_to_features(df.ww.schema)
    features = generate_features_from_primitives(
        origin_features,
        [Absolute, MultiplyNumeric, MultiOutputPrimitiveForTest],
    )

    features = generate_features_from_primitives(features, [Lag])

    assert set([x.name for x in features]) == set(
        [
            "idx",
            "t_idx",
            "f_1",
            "f_2",
            "f_3",
            "f_4",
            "f_5",
            "ABSOLUTE(f_1)",
            "ABSOLUTE(f_2)",
            "f_1 * f_2",
            "TEST_MO(f_5)[0]",
            "TEST_MO(f_5)[1]",
            "LAG(f_1, t_idx)",
            "LAG(f_2, t_idx)",
            "LAG(f_3, t_idx)",
            "LAG(f_4, t_idx)",
            "LAG(ABSOLUTE(f_1), t_idx)",
            "LAG(ABSOLUTE(f_2), t_idx)",
            "LAG(f_1 * f_2, t_idx)",
            "LAG(TEST_MO(f_5)[1], t_idx)",
            "LAG(TEST_MO(f_5)[0], t_idx)",
        ],
    )
    fc = FeatureCollection(features=features)
    fc_dict = fc.to_dict()

    fc_json = json.dumps(fc_dict)

    fc2_dict = json.loads(fc_json)

    fc2 = FeatureCollection.from_dict(fc2_dict)

    assert fc == fc2
    lsa_features = [x for x in fc2.all_features if x.get_primitive_name() == "test_mo"]
    assert len(lsa_features[0].related_features) == 1


def test_lite_feature_assertions():
    f1 = LiteFeature(name="f1", logical_type=Double)
    f2 = LiteFeature(name="f1", logical_type=Double, df_id="df1")

    assert f1 != f2

    with pytest.raises(
        TypeError,
        match="Name must be given if origin feature",
    ):
        LiteFeature(logical_type=Double)

    with pytest.raises(
        TypeError,
        match="Logical Type must be given if origin feature",
    ):
        LiteFeature(name="f1")

    with pytest.raises(
        ValueError,
        match="primitive input must be of type PrimitiveBase",
    ):
        LiteFeature(name="f3", primitive="AddNumeric", base_features=[f1, f2])

    f = LiteFeature("f4", logical_type=Double)
    with pytest.raises(AttributeError, match="name is immutable"):
        f.name = "new name"

    with pytest.raises(ValueError, match="only used on multioutput features"):
        f.non_indexed_name

    with pytest.raises(AttributeError, match="logical_type is immutable"):
        f.logical_type = Boolean

    with pytest.raises(AttributeError, match="tags is immutable"):
        f.tags = {"other"}

    with pytest.raises(AttributeError, match="primitive is immutable"):
        f.primitive = AddNumeric

    with pytest.raises(AttributeError, match="base_features are immutable"):
        f.base_features = [f1]

    with pytest.raises(AttributeError, match="df_id is immutable"):
        f.df_id = "df_id"

    with pytest.raises(AttributeError, match="id is immutable"):
        f.id = "id"

    with pytest.raises(AttributeError, match="n_output_features is immutable"):
        f.n_output_features = "n_output_features"

    with pytest.raises(AttributeError, match="depth is immutable"):
        f.depth = "depth"

    with pytest.raises(AttributeError, match="idx is immutable"):
        f.idx = "idx"


def test_lite_feature_to_column_schema():
    f1 = LiteFeature(name="f1", logical_type=Double, tags={"index", "numeric"})

    column_schema = f1.column_schema

    assert column_schema.is_numeric
    assert isinstance(column_schema.logical_type, Double)
    assert column_schema.semantic_tags == {"index", "numeric"}

    f2 = LiteFeature(name="f2", primitive=Absolute(), base_features=[f1])

    column_schema = f2.column_schema
    assert column_schema.semantic_tags == {"numeric"}


def test_lite_feature_to_dependent_primitives():
    f1 = LiteFeature(name="f1", logical_type=Double)

    f2 = LiteFeature(name="f2", primitive=Absolute(), base_features=[f1])

    f3 = LiteFeature(name="f3", primitive=AddNumeric(), base_features=[f1, f2])

    f4 = LiteFeature(name="f4", primitive=MultiplyNumeric(), base_features=[f1, f3])

    assert set([x.name for x in f4.dependent_primitives()]) == set(
        ["multiply_numeric", "absolute", "add_numeric"],
    )


================================================
FILE: featuretools/tests/primitive_tests/__init__.py
================================================


================================================
FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/__init__.py
================================================


================================================
FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_agg_primitives.py
================================================
from datetime import datetime
from math import sqrt

import numpy as np
import pandas as pd
import pytest
from pandas.core.dtypes.dtypes import CategoricalDtype
from pytest import raises

from featuretools.primitives import (
    AverageCountPerUnique,
    DateFirstEvent,
    Entropy,
    FirstLastTimeDelta,
    HasNoDuplicates,
    IsMonotonicallyDecreasing,
    IsMonotonicallyIncreasing,
    Kurtosis,
    MaxCount,
    MaxMinDelta,
    MedianCount,
    MinCount,
    NMostCommon,
    NMostCommonFrequency,
    NumFalseSinceLastTrue,
    NumPeaks,
    NumTrueSinceLastFalse,
    NumZeroCrossings,
    NUniqueDays,
    NUniqueDaysOfCalendarYear,
    NUniqueDaysOfMonth,
    NUniqueMonths,
    NUniqueWeeks,
    PercentTrue,
    Trend,
    Variance,
    get_aggregation_primitives,
)
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    check_serialize,
    find_applicable_primitives,
    valid_dfs,
)


def test_nmostcommon_categorical():
    n_most = NMostCommon(3)
    expected = pd.Series([1.0, 2.0, np.nan])

    ints = pd.Series([1, 2, 1, 1]).astype("int64")
    assert pd.Series(n_most(ints)).equals(expected)

    cats = pd.Series([1, 2, 1, 1]).astype("category")
    assert pd.Series(n_most(cats)).equals(expected)

    # Value counts includes data for categories that are not present in data.
    # Make sure these counts are not included in most common outputs
    extra_dtype = CategoricalDtype(categories=[1, 2, 3])
    cats_extra = pd.Series([1, 2, 1, 1]).astype(extra_dtype)
    assert pd.Series(n_most(cats_extra)).equals(expected)


def test_agg_primitives_can_init_without_params():
    agg_primitives = get_aggregation_primitives().values()
    for agg_primitive in agg_primitives:
        agg_primitive()


def test_trend_works_with_different_input_dtypes():
    dates = pd.to_datetime(["2020-01-01", "2020-01-02", "2020-01-03"])
    numeric = pd.Series([1, 2, 3])

    trend = Trend()
    dtypes = ["float64", "int64", "Int64"]

    for dtype in dtypes:
        actual = trend(numeric.astype(dtype), dates)
        assert np.isclose(actual, 1)


def test_percent_true_boolean():
    booleans = pd.Series([True, False, True, pd.NA], dtype="boolean")
    pct_true = PercentTrue()
    pct_true(booleans) == 0.5


class TestAverageCountPerUnique(PrimitiveTestBase):
    primitive = AverageCountPerUnique
    array = pd.Series([1, 1, 2, 2, 3, 4, 5, 6, 7, 8])

    def test_percent_unique(self):
        primitive_func = AverageCountPerUnique().get_function()
        assert primitive_func(self.array) == 1.25

    def test_nans(self):
        primitive_func = AverageCountPerUnique().get_function()
        array_nans = pd.concat([self.array.copy(), pd.Series([np.nan])])
        assert primitive_func(array_nans) == 1.25
        primitive_func = AverageCountPerUnique(skipna=False).get_function()
        array_nans = pd.concat([self.array.copy(), pd.Series([np.nan])])
        assert primitive_func(array_nans) == (11 / 9.0)

    def test_empty_string(self):
        primitive_func = AverageCountPerUnique().get_function()
        array_empty_string = pd.concat([self.array.copy(), pd.Series([np.nan, "", ""])])
        assert primitive_func(array_empty_string) == (4 / 3.0)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestVariance(PrimitiveTestBase):
    primitive = Variance

    def test_regular(self):
        variance = self.primitive().get_function()
        np.testing.assert_almost_equal(variance(np.array([0, 3, 4, 3])), 2.25)

    def test_single(self):
        variance = self.primitive().get_function()
        np.testing.assert_almost_equal(variance(np.array([4])), 0)

    def test_double(self):
        variance = self.primitive().get_function()
        np.testing.assert_almost_equal(variance(np.array([3, 4])), 0.25)

    def test_empty(self):
        variance = self.primitive().get_function()
        np.testing.assert_almost_equal(variance(np.array([])), np.nan)

    def test_nan(self):
        variance = self.primitive().get_function()
        np.testing.assert_almost_equal(
            variance(pd.Series([0, np.nan, 4, 3])),
            2.8888888888888893,
        )

    def test_allnan(self):
        variance = self.primitive().get_function()
        np.testing.assert_almost_equal(
            variance(pd.Series([np.nan, np.nan, np.nan])),
            np.nan,
        )


class TestFirstLastTimeDelta(PrimitiveTestBase):
    primitive = FirstLastTimeDelta
    times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)])
    actual_delta = (times.iloc[-1] - times.iloc[0]).total_seconds()

    def test_first_last_time_delta(self):
        primitive_func = self.primitive().get_function()
        assert primitive_func(self.times) == self.actual_delta

    def test_with_nans(self):
        primitive_func = self.primitive().get_function()
        times = pd.concat([self.times, pd.Series([np.nan])])
        assert primitive_func(times) == self.actual_delta
        assert pd.isna(primitive_func(pd.Series([np.nan])))

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestEntropy(PrimitiveTestBase):
    primitive = Entropy

    @pytest.mark.parametrize(
        "dtype",
        ["category", "object", "string"],
    )
    def test_regular(self, dtype):
        data = pd.Series([1, 2, 3, 2], dtype=dtype)
        primitive_func = self.primitive().get_function()
        given_answer = primitive_func(data)
        assert np.isclose(given_answer, 1.03, atol=0.01)

    @pytest.mark.parametrize(
        "dtype",
        ["category", "object", "string"],
    )
    def test_empty(self, dtype):
        data = pd.Series([], dtype=dtype)
        primitive_func = self.primitive().get_function()
        given_answer = primitive_func(data)
        assert given_answer == 0.0

    @pytest.mark.parametrize(
        "dtype",
        ["category", "object", "string"],
    )
    def test_args(self, dtype):
        data = pd.Series([1, 2, 3, 2], dtype=dtype)
        if dtype == "string":
            data = pd.concat([data, pd.Series([pd.NA, pd.NA], dtype=dtype)])
        else:
            data = pd.concat([data, pd.Series([np.nan, np.nan], dtype=dtype)])
        primitive_func = self.primitive(dropna=True, base=2).get_function()
        given_answer = primitive_func(data)
        assert np.isclose(given_answer, 1.5, atol=0.001)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive, max_depth=2)


class TestKurtosis(PrimitiveTestBase):
    primitive = Kurtosis

    @pytest.mark.parametrize(
        "dtype",
        ["int64", "float64"],
    )
    def test_regular(self, dtype):
        data = pd.Series([1, 2, 3, 4, 5], dtype=dtype)
        answer = -1.3
        primitive_func = self.primitive().get_function()
        given_answer = primitive_func(data)
        assert np.isclose(answer, given_answer, atol=0.01)

        data = pd.Series([1, 2, 3, 4, 5, 6], dtype=dtype)
        answer = -1.26
        primitive_func = self.primitive().get_function()
        given_answer = primitive_func(data)
        assert np.isclose(answer, given_answer, atol=0.01)

        data = pd.Series([x * x for x in list(range(100))], dtype=dtype)
        answer = -0.85
        primitive_func = self.primitive().get_function()
        given_answer = primitive_func(data)
        assert np.isclose(answer, given_answer, atol=0.01)

        if dtype == "float64":
            # Series contains floating point values - only check with float dtype
            data = pd.Series([sqrt(x) for x in list(range(100))], dtype=dtype)
            answer = -0.46
            primitive_func = self.primitive().get_function()
            given_answer = primitive_func(data)
            assert np.isclose(answer, given_answer, atol=0.01)

    def test_nan(self):
        data = pd.Series([np.nan, 5, 3], dtype="float64")
        primitive_func = self.primitive().get_function()
        given_answer = primitive_func(data)
        assert pd.isna(given_answer)

    @pytest.mark.parametrize(
        "dtype",
        ["int64", "float64"],
    )
    def test_empty(self, dtype):
        data = pd.Series([], dtype=dtype)
        primitive_func = self.primitive().get_function()
        given_answer = primitive_func(data)
        assert pd.isna(given_answer)

    def test_inf(self):
        data = pd.Series([1, np.inf], dtype="float64")
        primitive_func = self.primitive().get_function()
        given_answer = primitive_func(data)
        assert pd.isna(given_answer)

        data = pd.Series([np.NINF, 1, np.inf], dtype="float64")
        primitive_func = self.primitive().get_function()
        given_answer = primitive_func(data)
        assert pd.isna(given_answer)

    def test_arg(self):
        data = pd.Series([1, 2, 3, 4, 5, np.nan, np.nan], dtype="float64")
        answer = -1.3
        primitive_func = self.primitive(nan_policy="omit").get_function()
        given_answer = primitive_func(data)
        assert answer == given_answer

        primitive_func = self.primitive(nan_policy="propagate").get_function()
        given_answer = primitive_func(data)
        assert np.isnan(given_answer)

        primitive_func = self.primitive(nan_policy="raise").get_function()
        with raises(ValueError):
            primitive_func(data)

    def test_error(self):
        with raises(ValueError):
            self.primitive(nan_policy="invalid_policy").get_function()

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestNumZeroCrossings(PrimitiveTestBase):
    primitive = NumZeroCrossings

    def test_nan(self):
        data = pd.Series([3, np.nan, 5, 3, np.nan, 0, np.nan, 0, np.nan, -2])
        # crossing from 0 to np.nan to -2, which is 1 crossing
        answer = 1
        primtive_func = self.primitive().get_function()
        given_answer = primtive_func(data)
        assert given_answer == answer

    def test_empty(self):
        data = pd.Series([], dtype="int64")
        answer = 0
        primtive_func = self.primitive().get_function()
        given_answer = primtive_func(data)
        assert given_answer == answer

    def test_inf(self):
        data = pd.Series([-1, np.inf])
        answer = 1
        primtive_func = self.primitive().get_function()
        given_answer = primtive_func(data)
        assert given_answer == answer

        data = pd.Series([np.NINF, 1, np.inf])
        answer = 1
        primtive_func = self.primitive().get_function()
        given_answer = primtive_func(data)
        assert given_answer == answer

    def test_zeros(self):
        data = pd.Series([1, 0, -1, 0, 1, 0, -1])
        answer = 3
        primtive_func = self.primitive().get_function()
        given_answer = primtive_func(data)
        assert given_answer == answer

        data = pd.Series([1, 0, 1, 0, 1])
        answer = 0
        primtive_func = self.primitive().get_function()
        given_answer = primtive_func(data)
        assert given_answer == answer

    def test_regular(self):
        data = pd.Series([1, 2, 3, 4, 5])
        answer = 0
        primtive_func = self.primitive().get_function()
        given_answer = primtive_func(data)
        assert given_answer == answer

        data = pd.Series([1, -1, 2, -2, 3, -3])
        answer = 5
        primtive_func = self.primitive().get_function()
        given_answer = primtive_func(data)
        assert given_answer == answer

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestNumTrueSinceLastFalse(PrimitiveTestBase):
    primitive = NumTrueSinceLastFalse

    def test_regular(self):
        primitive_func = self.primitive().get_function()
        bools = pd.Series([False, True, False, True, True])
        answer = primitive_func(bools)
        correct_answer = 2
        assert answer == correct_answer

    def test_regular_end_in_false(self):
        primitive_func = self.primitive().get_function()
        bools = pd.Series([False, True, False, True, True, False])
        answer = primitive_func(bools)
        correct_answer = 0
        assert answer == correct_answer

    def test_no_false(self):
        primitive_func = self.primitive().get_function()
        bools = pd.Series([True] * 5)
        assert pd.isna(primitive_func(bools))

    def test_all_false(self):
        primitive_func = self.primitive().get_function()
        bools = pd.Series([False, False, False])
        answer = primitive_func(bools)
        correct_answer = 0
        assert answer == correct_answer

    def test_nan(self):
        primitive_func = self.primitive().get_function()
        bools = pd.Series([False, True, np.nan, True, True])
        answer = primitive_func(bools)
        correct_answer = 3
        assert answer == correct_answer

    def test_all_nan(self):
        primitive_func = self.primitive().get_function()
        bools = pd.Series([np.nan, np.nan, np.nan])
        assert pd.isna(primitive_func(bools))

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestNumFalseSinceLastTrue(PrimitiveTestBase):
    primitive = NumFalseSinceLastTrue

    def test_regular(self):
        primitive_func = self.primitive().get_function()
        bools = pd.Series([True, False, True, False, False])
        answer = primitive_func(bools)
        correct_answer = 2
        assert answer == correct_answer

    def test_regular_end_in_true(self):
        primitive_func = self.primitive().get_function()
        bools = pd.Series([True, False, True, False, False, True])
        answer = primitive_func(bools)
        correct_answer = 0
        assert answer == correct_answer

    def test_no_true(self):
        primitive_func = self.primitive().get_function()
        bools = pd.Series([False] * 5)
        assert pd.isna(primitive_func(bools))

    def test_all_true(self):
        primitive_func = self.primitive().get_function()
        bools = pd.Series([True, True, True])
        answer = primitive_func(bools)
        correct_answer = 0
        assert answer == correct_answer

    def test_nan(self):
        primitive_func = self.primitive().get_function()
        bools = pd.Series([True, False, np.nan, False, False])
        answer = primitive_func(bools)
        correct_answer = 3
        assert answer == correct_answer

    def test_all_nan(self):
        primitive_func = self.primitive().get_function()
        bools = pd.Series([np.nan, np.nan, np.nan])
        assert pd.isna(primitive_func(bools))

    def test_numeric_and_string_input(self):
        primitive_func = self.primitive().get_function()
        bools = pd.Series([True, 0, 1, "10", ""])
        answer = primitive_func(bools)
        correct_answer = 1
        assert answer == correct_answer

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestNumPeaks(PrimitiveTestBase):
    primitive = NumPeaks

    @pytest.mark.parametrize(
        "dtype",
        ["int64", "float64", "Int64"],
    )
    def test_negative_and_positive_nums(self, dtype):
        get_peaks = self.primitive().get_function()
        assert (
            get_peaks(pd.Series([-5, 0, 10, 0, 10, -5, -4, -5, 10, 0], dtype=dtype))
            == 4
        )

    @pytest.mark.parametrize(
        "dtype",
        ["int64", "float64", "Int64"],
    )
    def test_plateu(self, dtype):
        get_peaks = self.primitive().get_function()
        assert get_peaks(pd.Series([1, 2, 3, 3, 3, 3, 3, 2, 1], dtype=dtype)) == 1
        assert get_peaks(pd.Series([1, 2, 3, 3, 3, 4, 3, 3, 3, 2, 1], dtype=dtype)) == 1
        assert (
            get_peaks(
                pd.Series(
                    [
                        5,
                        4,
                        3,
                        3,
                        3,
                        3,
                        3,
                        3,
                        4,
                        5,
                        5,
                        5,
                        5,
                        5,
                        3,
                        3,
                        3,
                        3,
                        4,
                    ],
                    dtype=dtype,
                ),
            )
            == 1
        )
        assert (
            get_peaks(
                pd.Series(
                    [
                        1,
                        2,
                        3,
                        3,
                        3,
                        2,
                        1,
                        2,
                        3,
                        3,
                        3,
                        2,
                        5,
                        5,
                        5,
                        2,
                    ],
                    dtype=dtype,
                ),
            )
            == 3
        )

    @pytest.mark.parametrize(
        "dtype",
        ["int64", "float64", "Int64"],
    )
    def test_regular(self, dtype):
        get_peaks = self.primitive().get_function()
        assert get_peaks(pd.Series([1, 7, 3, 8, 2, 3, 4, 3, 4, 2, 4], dtype=dtype)) == 4
        assert get_peaks(pd.Series([1, 2, 3, 2, 1], dtype=dtype)) == 1

    @pytest.mark.parametrize(
        "dtype",
        ["int64", "float64", "Int64"],
    )
    def test_no_peak(self, dtype):
        get_peaks = self.primitive().get_function()
        assert get_peaks(pd.Series([1, 2, 3], dtype=dtype)) == 0
        assert get_peaks(pd.Series([3, 2, 2, 2, 2, 1], dtype=dtype)) == 0

    @pytest.mark.parametrize(
        "dtype",
        ["int64", "float64", "Int64"],
    )
    def test_too_small_data(self, dtype):
        get_peaks = self.primitive().get_function()
        assert get_peaks(pd.Series([], dtype=dtype)) == 0
        assert get_peaks(pd.Series([1])) == 0
        assert get_peaks(pd.Series([1, 1])) == 0
        assert get_peaks(pd.Series([1, 2])) == 0
        assert get_peaks(pd.Series([2, 1])) == 0

    @pytest.mark.parametrize(
        "dtype",
        ["int64", "float64", "Int64"],
    )
    def test_nans(self, dtype):
        get_peaks = self.primitive().get_function()
        array = pd.Series(
            [
                0,
                5,
                10,
                15,
                20,
                0,
                1,
                2,
                3,
                0,
                0,
                5,
                0,
                7,
                14,
            ],
            dtype=dtype,
        )
        if dtype == "float64":
            array = pd.concat([array, pd.Series([np.nan, np.nan])])
        elif dtype == "Int64":
            array = pd.concat([array, pd.Series([pd.NA, pd.NA])])
        array = array.astype(dtype=dtype)
        assert get_peaks(array) == 3

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestDateFirstEvent(PrimitiveTestBase):
    primitive = DateFirstEvent

    def test_regular(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series(
            [
                "2011-04-09 10:30:00",
                "2011-04-09 10:30:06",
                "2011-04-09 10:30:12",
                "2011-04-09 10:30:18",
            ],
            dtype="datetime64[ns]",
        )
        answer = pd.Timestamp("2011-04-09 10:30:00")
        given_answer = primitive_func(case)
        assert given_answer == answer

    def test_nat(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series(
            [
                pd.NaT,
                pd.NaT,
                "2011-04-09 10:30:12",
                "2011-04-09 10:30:18",
            ],
            dtype="datetime64[ns]",
        )
        answer = pd.Timestamp("2011-04-09 10:30:12")
        given_answer = primitive_func(case)
        assert given_answer == answer

    def test_empty(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series([], dtype="datetime64[ns]")
        given_answer = primitive_func(case)
        assert pd.isna(given_answer)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)

    def test_serialize(self, es):
        check_serialize(self.primitive, es, target_dataframe_name="sessions")


class TestMinCount(PrimitiveTestBase):
    primitive = MinCount

    def test_nan(self):
        data = pd.Series([np.nan, np.nan, np.nan])
        primitive_func = self.primitive().get_function()
        answer = primitive_func(data)
        assert pd.isna(answer)

    def test_inf(self):
        data = pd.Series([5, 10, 10, np.inf, np.inf, np.inf])
        primitive_func = self.primitive().get_function()
        answer = primitive_func(data)
        assert answer == 1

    def test_regular(self):
        data = pd.Series([1, 2, 2, 2, 3, 4, 4, 4, 5])
        primitive_func = self.primitive().get_function()
        answer = primitive_func(data)
        assert answer == 1

        data = pd.Series([2, 2, 2, 3, 4, 4, 4])
        primitive_func = self.primitive().get_function()
        answer = primitive_func(data)
        assert answer == 3

    def test_skipna(self):
        data = pd.Series([1, 1, 2, 3, 4, 4, np.nan, 5])
        primitive_func = self.primitive(skipna=False).get_function()
        answer = primitive_func(data)
        assert pd.isna(answer)

    def test_ninf(self):
        data = pd.Series([np.NINF, np.NINF, np.nan])
        primitive_func = self.primitive().get_function()
        answer = primitive_func(data)
        assert answer == 2

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestMaxCount(PrimitiveTestBase):
    primitive = MaxCount

    def test_nan(self):
        data = pd.Series([np.nan, np.nan, np.nan])
        primitive_func = self.primitive().get_function()
        answer = primitive_func(data)
        assert pd.isna(answer)

    def test_inf(self):
        data = pd.Series([5, 10, 10, np.inf, np.inf, np.inf])
        primitive_func = self.primitive().get_function()
        answer = primitive_func(data)
        assert answer == 3

    def test_regular(self):
        data = pd.Series([1, 1, 2, 3, 4, 4, 4, 5])
        primitive_func = self.primitive().get_function()
        answer = primitive_func(data)
        assert answer == 1

        data = pd.Series([1, 1, 2, 3, 4, 4, 4])
        primitive_func = self.primitive().get_function()
        answer = primitive_func(data)
        assert answer == 3

    def test_skipna(self):
        data = pd.Series([1, 1, 2, 3, 4, 4, np.nan, 5])
        primitive_func = self.primitive(skipna=False).get_function()
        answer = primitive_func(data)
        assert pd.isna(answer)

    def test_ninf(self):
        data = pd.Series([np.NINF, np.NINF, np.nan])
        primitive_func = self.primitive().get_function()
        answer = primitive_func(data)
        assert answer == 2

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestMaxMinDelta(PrimitiveTestBase):
    primitive = MaxMinDelta
    array = pd.Series([1, 1, 2, 2, 3, 4, 5, 6, 7, 8])

    def test_max_min_delta(self):
        primitive_func = self.primitive().get_function()
        assert primitive_func(self.array) == 7.0

    def test_nans(self):
        primitive_func = self.primitive().get_function()
        array_nans = pd.concat([self.array, pd.Series([np.nan])])
        assert primitive_func(array_nans) == 7.0
        primitive_func = self.primitive(skipna=False).get_function()
        array_nans = pd.concat([self.array, pd.Series([np.nan])])
        assert pd.isna(primitive_func(array_nans))

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestMedianCount(PrimitiveTestBase):
    primitive = MedianCount

    def test_regular(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series([1, 3, 5, 7])
        given_answer = primitive_func(case)
        assert given_answer == 0

    def test_nans(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series([1, 3, 4, 4, 4, 5, 7, np.nan, np.nan])
        given_answer = primitive_func(case)
        assert given_answer == 3
        primitive_func = self.primitive(skipna=False).get_function()
        given_answer = primitive_func(case)
        assert pd.isna(given_answer)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestNMostCommonFrequency(PrimitiveTestBase):
    primitive = NMostCommonFrequency

    def test_regular(self):
        test_cases = [
            pd.Series([8, 7, 10, 10, 10, 3, 4, 5, 10, 8, 7]),
            pd.Series([7, 7, 7, 6, 6, 5, 4]),
            pd.Series([4, 5, 6, 6, 7, 7, 7]),
        ]

        answers = [
            pd.Series([4, 2, 2]),
            pd.Series([3, 2, 1]),
            pd.Series([3, 2, 1]),
        ]

        primtive_func = self.primitive(3).get_function()

        for case, answer in zip(test_cases, answers):
            given_answer = primtive_func(case)
            given_answer = given_answer.reset_index(drop=True)
            assert given_answer.equals(answer)

    def test_n_larger_than_len(self):
        test_cases = [
            pd.Series(["red", "red", "blue", "green"]),
            pd.Series(["red", "red", "red", "blue", "green"]),
            pd.Series(["red", "blue", "green", "orange"]),
        ]
        answers = [
            pd.Series([2, 1, 1, np.nan, np.nan]),
            pd.Series([3, 1, 1, np.nan, np.nan]),
            pd.Series([1, 1, 1, 1, np.nan]),
        ]

        primtive_func = self.primitive(5).get_function()
        for case, answer in zip(test_cases, answers):
            given_answer = primtive_func(case)
            given_answer = given_answer.reset_index(drop=True)
            assert given_answer.equals(answer)

    def test_skipna(self):
        array = pd.Series(["red", "red", "blue", "green", np.nan, np.nan])
        primtive_func = self.primitive(5, skipna=False).get_function()
        given_answer = primtive_func(array)
        given_answer = given_answer.reset_index(drop=True)
        answer = pd.Series([2, 2, 1, 1, np.nan])
        assert given_answer.equals(answer)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        aggregation.append(self.primitive(5))
        valid_dfs(
            es,
            aggregation,
            transform,
            self.primitive,
            target_dataframe_name="customers",
            multi_output=True,
        )

    def test_with_featuretools_args(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        aggregation.append(self.primitive(5, skipna=False))
        valid_dfs(
            es,
            aggregation,
            transform,
            self.primitive,
            target_dataframe_name="customers",
            multi_output=True,
        )

    def test_serialize(self, es):
        check_serialize(
            primitive=self.primitive,
            es=es,
            target_dataframe_name="customers",
        )


class TestNUniqueDays(PrimitiveTestBase):
    primitive = NUniqueDays

    def test_two_years(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2010-01-01", "2011-12-31"))
        assert primitive_func(array) == 365 * 2

    def test_leap_year(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2016-01-01", "2017-12-31"))
        assert primitive_func(array) == 365 * 2 + 1

    def test_ten_years(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2010-01-01", "2019-12-31"))
        assert primitive_func(array) == 365 * 10 + 1 + 1

    def test_distinct_dt(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                datetime(2019, 2, 21),
                datetime(2019, 2, 1, 1, 20, 0),
                datetime(2019, 2, 1, 1, 30, 0),
                datetime(2018, 2, 1),
                datetime(2019, 1, 1),
            ],
        )
        assert primitive_func(array) == 4

    def test_NaT(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2010-01-01", "2011-12-31"))
        NaT_array = pd.Series([pd.NaT] * 100)
        assert primitive_func(pd.concat([array, NaT_array])) == 365 * 2

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestNUniqueDaysOfCalendarYear(PrimitiveTestBase):
    primitive = NUniqueDaysOfCalendarYear

    def test_two_years(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2010-01-01", "2011-12-31"))
        assert primitive_func(array) == 365

    def test_leap_year(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2016-01-01", "2017-12-31"))
        assert primitive_func(array) == 366

    def test_ten_years(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2010-01-01", "2019-12-31"))
        assert primitive_func(array) == 366

    def test_distinct_dt(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                datetime(2019, 2, 21),
                datetime(2019, 2, 1, 1, 20, 0),
                datetime(2019, 2, 1, 1, 30, 0),
                datetime(2018, 2, 1),
                datetime(2019, 1, 1),
            ],
        )
        assert primitive_func(array) == 3

    def test_NaT(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2010-01-01", "2011-12-31"))
        NaT_array = pd.Series([pd.NaT] * 100)
        assert primitive_func(pd.concat([array, NaT_array])) == 365

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestNUniqueDaysOfMonth(PrimitiveTestBase):
    primitive = NUniqueDaysOfMonth

    def test_two_days(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2010-01-01", "2010-01-02"))
        assert primitive_func(array) == 2

    def test_one_year(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2010-01-01", "2010-12-31"))
        assert primitive_func(array) == 31

    def test_leap_year(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2016-01-01", "2017-12-31"))
        assert primitive_func(array) == 31

    def test_distinct_dt(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                datetime(2019, 2, 21),
                datetime(2019, 2, 1, 1, 20, 0),
                datetime(2019, 2, 1, 1, 30, 0),
                datetime(2018, 2, 1),
                datetime(2019, 1, 1),
            ],
        )
        assert primitive_func(array) == 2

    def test_NaT(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2010-01-01", "2010-12-31"))
        NaT_array = pd.Series([pd.NaT] * 100)
        assert primitive_func(pd.concat([array, NaT_array])) == 31

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestNUniqueMonths(PrimitiveTestBase):
    primitive = NUniqueMonths

    def test_two_days(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2010-01-01", "2010-01-02"))
        assert primitive_func(array) == 1

    def test_ten_years(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2010-01-01", "2019-12-31"))
        assert primitive_func(array) == 12 * 10

    def test_distinct_dt(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                datetime(2019, 2, 21),
                datetime(2019, 2, 1, 1, 20, 0),
                datetime(2019, 2, 1, 1, 30, 0),
                datetime(2018, 2, 1),
                datetime(2019, 1, 1),
            ],
        )
        assert primitive_func(array) == 3

    def test_NaT(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2010-01-01", "2011-12-31"))
        NaT_array = pd.Series([pd.NaT] * 100)
        assert primitive_func(pd.concat([array, NaT_array])) == 12 * 2

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestNUniqueWeeks(PrimitiveTestBase):
    primitive = NUniqueWeeks

    def test_same_week(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2019-01-01", "2019-01-02"))
        assert primitive_func(array) == 1

    def test_ten_years(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2010-01-01", "2019-12-31"))
        assert primitive_func(array) == 523

    def test_distinct_dt(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                datetime(2019, 2, 21),
                datetime(2019, 2, 1, 1, 20, 0),
                datetime(2019, 2, 1, 1, 30, 0),
                datetime(2018, 2, 2),
                datetime(2019, 2, 3, 1, 30, 0),
                datetime(2019, 1, 1),
            ],
        )
        assert primitive_func(array) == 4

    def test_NaT(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(pd.date_range("2019-01-01", "2019-01-02"))
        NaT_array = pd.Series([pd.NaT] * 100)
        assert primitive_func(pd.concat([array, NaT_array])) == 1

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        aggregation.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestHasNoDuplicates(PrimitiveTestBase):
    primitive = HasNoDuplicates

    def test_regular(self):
        primitive_func = self.primitive().get_function()
        data = pd.Series([1, 1, 2])
        assert not primitive_func(data)
        assert isinstance(primitive_func(data), bool)

        data = pd.Series([1, 2, 3])
        assert primitive_func(data)
        assert isinstance(primitive_func(data), bool)

        data = pd.Series([1, 2, 4])
        assert primitive_func(data)
        assert isinstance(primitive_func(data), bool)

        data = pd.Series(["red", "blue", "orange"])
        assert primitive_func(data)
        assert isinstance(primitive_func(data), bool)

        data = pd.Series(["red", "blue", "red"])
        assert not primitive_func(data)

    def test_nan(self):
        primitive_func = self.primitive().get_function()
        data = pd.Series([np.nan, 1, 2, 3])
        assert primitive_func(data)
        assert isinstance(primitive_func(data), bool)

        data = pd.Series([np.nan, np.nan, 1])
        # drop both nans, so has 1 value
        assert primitive_func(data) is True
        assert isinstance(primitive_func(data), bool)

        primitive_func = self.primitive(skipna=False).get_function()
        data = pd.Series([np.nan, np.nan, 1])
        assert primitive_func(data) is False
        assert isinstance(primitive_func(data), bool)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instantiate = self.primitive()
        aggregation.append(primitive_instantiate)
        valid_dfs(
            es,
            aggregation,
            transform,
            self.primitive,
            target_dataframe_name="customers",
            instance_ids=[0, 1, 2],
        )


class TestIsMonotonicallyDecreasing(PrimitiveTestBase):
    primitive = IsMonotonicallyDecreasing

    def test_monotonically_decreasing(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series([9, 5, 3, 1, -1])
        assert primitive_func(case) is True

    def test_monotonically_increasing(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series([-1, 1, 3, 5, 9])
        assert primitive_func(case) is False

    def test_non_monotonic(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series([-1, 1, 3, 2, 5])
        assert primitive_func(case) is False

    def test_weakly_decreasing(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series([9, 3, 3, 1, -1])
        assert primitive_func(case) is True

    def test_nan(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series([9, 5, 3, np.nan, 1, -1])
        assert primitive_func(case) is True

        primitive_func = self.primitive().get_function()
        case = pd.Series([-1, 1, 3, np.nan, 5, 9])
        assert primitive_func(case) is False

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instantiate = self.primitive()
        aggregation.append(primitive_instantiate)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestIsMonotonicallyIncreasing(PrimitiveTestBase):
    primitive = IsMonotonicallyIncreasing

    def test_monotonically_increasing(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series([-1, 1, 3, 5, 9])
        assert primitive_func(case) is True

    def test_monotonically_decreasing(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series([9, 5, 3, 1, -1])
        assert primitive_func(case) is False

    def test_non_monotonic(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series([-1, 1, 3, 2, 5])
        assert primitive_func(case) is False

    def test_weakly_increasing(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series([-1, 1, 3, 3, 9])
        assert primitive_func(case) is True

    def test_nan(self):
        primitive_func = self.primitive().get_function()
        case = pd.Series([-1, 1, 3, np.nan, 5, 9])
        assert primitive_func(case) is True

        primitive_func = self.primitive().get_function()
        case = pd.Series([9, 5, 3, np.nan, 1, -1])
        assert primitive_func(case) is False

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instantiate = self.primitive()
        aggregation.append(primitive_instantiate)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_count_aggregation_primitives.py
================================================
import numpy as np
import pandas as pd
from pytest import raises

from featuretools.primitives import (
    CountAboveMean,
    CountGreaterThan,
    CountInsideNthSTD,
    CountInsideRange,
    CountLessThan,
    CountOutsideNthSTD,
    CountOutsideRange,
)
from featuretools.tests.primitive_tests.utils import PrimitiveTestBase


class TestCountAboveMean(PrimitiveTestBase):
    primitive = CountAboveMean

    def test_regular(self):
        data = pd.Series([1, 2, 3, 4, 5])
        expected = 2
        primitive_func = self.primitive().get_function()
        actual = primitive_func(data)
        assert expected == actual

        data = pd.Series([1, 2, 3.1, 4, 5])
        expected = 3
        primitive_func = self.primitive().get_function()
        actual = primitive_func(data)
        assert expected == actual

    def test_nan_without_ignore_nan(self):
        data = pd.Series([np.nan, 1, 2, 3, 4, 5, np.nan, np.nan])
        expected = np.nan

        primitive_func = self.primitive(skipna=False).get_function()
        actual = primitive_func(data)
        assert np.isnan(actual) == np.isnan(expected)

        data = pd.Series([np.nan])
        primitive_func = self.primitive(skipna=False).get_function()
        actual = primitive_func(data)
        assert np.isnan(actual) == np.isnan(expected)

    def test_nan_with_ignore_nan(self):
        data = pd.Series([np.nan, 1, 2, 3, 4, 5, np.nan, np.nan])
        expected = 2
        primitive_func = self.primitive(skipna=True).get_function()
        actual = primitive_func(data)
        assert expected == actual

        data = pd.Series([np.nan, 1, 2, 3.1, 4, 5, np.nan, np.nan])
        expected = 3
        primitive_func = self.primitive(skipna=True).get_function()
        actual = primitive_func(data)
        assert expected == actual

        data = pd.Series([np.nan])
        expected = np.nan
        primitive_func = self.primitive(skipna=True).get_function()
        actual = primitive_func(data)
        assert np.isnan(actual) == np.isnan(expected)

    def test_inf(self):
        data = pd.Series([np.NINF, 1, 2, 3, 4, 5])
        expected = 5
        primitive_func = self.primitive().get_function()
        actual = primitive_func(data)
        assert expected == actual

        data = pd.Series([1, 2, 3, 4, 5, np.inf])
        expected = 0
        primitive_func = self.primitive().get_function()
        actual = primitive_func(data)
        assert expected == actual

        data = pd.Series([np.NINF, 1, 2, 3, 4, 5, np.inf])
        expected = np.nan
        primitive_func = self.primitive().get_function()
        actual = primitive_func(data)
        assert np.isnan(actual) == np.isnan(expected)

        primitive_func = self.primitive(skipna=False).get_function()
        actual = primitive_func(data)
        assert np.isnan(actual) == np.isnan(expected)


class TestCountGreaterThan(PrimitiveTestBase):
    primitive = CountGreaterThan

    def compare_results(self, data, thresholds, results):
        for threshold, result in zip(thresholds, results):
            primitive = self.primitive(threshold=threshold)
            function = primitive.get_function()
            assert function(data) == result
            assert isinstance(function(data), np.int64)

    def test_regular(self):
        data = pd.Series([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5])
        thresholds = pd.Series([-5, -2, 0, 2, 5])
        results = pd.Series([10, 7, 5, 3, 0])
        self.compare_results(data, thresholds, results)

    def test_edges(self):
        data = pd.Series([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5])
        thresholds = pd.Series([np.inf, np.NINF, None, np.nan])
        results = pd.Series([0, len(data), 0, 0])
        self.compare_results(data, thresholds, results)

    def test_nans(self):
        data = pd.Series([-5, -4, -3, np.inf, np.NINF, np.nan, 1, 2, 3, 4, 5])
        thresholds = pd.Series([np.inf, np.NINF, None, 0, np.nan])
        results = pd.Series([0, 9, 0, 6, 0])
        self.compare_results(data, thresholds, results)


class TestCountInsideNthSTD:
    primitive = CountInsideNthSTD

    def test_normal_distribution(self):
        x = pd.Series(
            [
                -76.0,
                41.0,
                -43.0,
                -152.0,
                -89.0,
                28.0,
                49.0,
                298.0,
                -132.0,
                146.0,
                -107.0,
                -26.0,
                26.0,
                -81.0,
                116.0,
                -217.0,
                -102.0,
                144.0,
                120.0,
                -130.0,
            ],
        )

        first_outliers = [-152.0, 298.0, 146.0, 116.0, -217.0, 144.0, 120.0]
        primitive_instance = self.primitive(1)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(x) - len(first_outliers)

        second_outliers = [298.0]
        primitive_instance = self.primitive(2)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(x) - len(second_outliers)

    def test_poisson_distribution(self):
        x = pd.Series(
            [
                1,
                1,
                3,
                3,
                0,
                0,
                1,
                3,
                3,
                1,
                2,
                3,
                2,
                0,
                1,
                3,
                2,
                1,
                0,
                2,
            ],
        )

        first_outliers = [3, 3, 0, 0, 3, 3, 3, 0, 3, 0]
        primitive_instance = self.primitive(1)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(x) - len(first_outliers)

        second_outliers = []
        primitive_instance = self.primitive(2)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(x) - len(second_outliers)

    def test_nan(self):
        # test if function ignores nan values
        x = pd.Series(
            [
                -76.0,
                41.0,
                -43.0,
                -152.0,
                -89.0,
                28.0,
                49.0,
                298.0,
                -132.0,
                146.0,
                -107.0,
                -26.0,
                26.0,
                -81.0,
                116.0,
                -217.0,
                -102.0,
                144.0,
                120.0,
                -130.0,
            ],
        )
        x = pd.concat([x, pd.Series([np.nan] * 20)])
        first_outliers = [-152.0, 298.0, 146.0, 116.0, -217.0, 144.0, 120.0]
        primitive_instance = self.primitive(1)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(x) - len(first_outliers) - 20

        # test a series with all nan values
        x = pd.Series([np.nan] * 20)

        primitive_instance = self.primitive(1)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 0

    def test_negative_n(self):
        with raises(ValueError):
            self.primitive(-1)


class TestCountInsideRange(PrimitiveTestBase):
    primitive = CountInsideRange

    def test_integer_range(self):
        # all integers from -100 to 100
        x = pd.Series(np.arange(-100, 101, 1))
        primitive_instance = self.primitive(-100, 100)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 201

        primitive_instance = self.primitive(-50, 50)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 101

        primitive_instance = self.primitive(1, 1)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 1

    def test_float_range(self):
        x = pd.Series(np.linspace(-3, 3, 10))

        primitive_instance = self.primitive(-3, 3)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 10

        primitive_instance = self.primitive(-0.34, 1.68)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 4

        primitive_instance = self.primitive(-3, -3)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 1

    def test_nan(self):
        x = pd.Series(np.linspace(-3, 3, 10))
        x = pd.concat([x, pd.Series([np.nan] * 20)])

        primitive_instance = self.primitive(-0.34, 1.68)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 4

        primitive_instance = self.primitive(-3, 3, False)
        primitive_func = primitive_instance.get_function()
        assert np.isnan(primitive_func(x))

    def test_inf(self):
        x = pd.Series(np.linspace(-3, 3, 10))
        num_NINF = 20
        x = pd.concat([x, pd.Series([np.NINF] * num_NINF)])
        num_inf = 10
        x = pd.concat([x, pd.Series([np.inf] * num_inf)])

        primitive_instance = self.primitive(-3, 3)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 10

        primitive_instance = self.primitive(np.NINF, 3)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 10 + num_NINF

        primitive_instance = self.primitive(-3, np.inf)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 10 + num_inf


class TestCountLessThan(PrimitiveTestBase):
    primitive = CountLessThan

    def compare_answers(self, data, thresholds, answers):
        for threshold, answer in zip(thresholds, answers):
            primitive = self.primitive(threshold=threshold)
            function = primitive.get_function()
            assert function(data) == answer
            assert isinstance(function(data), np.int64)

    def test_regular(self):
        data = pd.Series([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5])
        thresholds = pd.Series([-5, -2, 0, 2, 5])
        answers = pd.Series([0, 3, 5, 7, 10])
        self.compare_answers(data, thresholds, answers)

    def test_edges(self):
        data = pd.Series([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5])
        thresholds = pd.Series([np.inf, np.NINF, None, np.nan])
        answers = pd.Series([len(data), 0, 0, 0])
        self.compare_answers(data, thresholds, answers)

    def test_nans(self):
        data = pd.Series([-5, -4, -3, np.inf, np.NINF, np.nan, 1, 2, 3, 4, 5])
        thresholds = pd.Series([np.inf, np.NINF, None, 0, np.nan])
        answers = pd.Series([9, 0, 0, 4, 0])
        self.compare_answers(data, thresholds, answers)


class TestCountOutsideNthSTD(PrimitiveTestBase):
    primitive = CountOutsideNthSTD

    def test_normal_distribution(self):
        x = pd.Series(
            [
                10,
                386,
                479,
                627,
                20,
                523,
                482,
                483,
                542,
                699,
                535,
                617,
                577,
                471,
                615,
                583,
                441,
                562,
                563,
                527,
                453,
                530,
                433,
                541,
                585,
                704,
                443,
                569,
                430,
                637,
                331,
                511,
                552,
                496,
                484,
                566,
                554,
                472,
                335,
                440,
                579,
                341,
                545,
                615,
                548,
                604,
                439,
                556,
                442,
                461,
                624,
                611,
                444,
                578,
                405,
                487,
                490,
                496,
                398,
                512,
                422,
                455,
                449,
                432,
                607,
                679,
                434,
                597,
                639,
                565,
                415,
                486,
                668,
                414,
                665,
                763,
                557,
                304,
                404,
                454,
                689,
                610,
                483,
                441,
                657,
                590,
                492,
                476,
                437,
                483,
                529,
                363,
                711,
                543,
            ],
        )
        outliers = [10, 20, 763]
        primitive_instance = self.primitive(2)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(outliers)

    def test_poisson_distribution(self):
        x = pd.Series(
            [
                1,
                1,
                3,
                3,
                0,
                0,
                1,
                3,
                3,
                1,
                2,
                3,
                2,
                0,
                1,
                3,
                2,
                1,
                0,
                2,
            ],
        )

        primitive_instance = self.primitive(1)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 10

        primitive_instance = self.primitive(2)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 0

    def test_nan(self):
        # test if function ignores nan values
        x = pd.Series(
            [
                -76.0,
                41.0,
                -43.0,
                -152.0,
                -89.0,
                28.0,
                49.0,
                298.0,
                -132.0,
                146.0,
                -107.0,
                -26.0,
                26.0,
                -81.0,
                116.0,
                -217.0,
                -102.0,
                144.0,
                120.0,
                -130.0,
            ],
        )
        x = pd.concat([x, pd.Series([np.nan * 20])])
        primitive_instance = self.primitive(1)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 7

        # test a series with all nan values
        x = pd.Series([np.nan] * 20)

        primitive_instance = self.primitive(1)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 0

    def test_negative_n(self):
        with raises(ValueError):
            self.primitive(-1)


class TestCountOutsideRange(PrimitiveTestBase):
    primitive = CountOutsideRange

    def test_integer_range(self):
        # all integers from -100 to 100
        x = pd.Series(np.arange(-100, 101, 1))
        primitive_instance = CountOutsideRange(-100, 100)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 0

        primitive_instance = CountOutsideRange(-50, 50)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 100

        primitive_instance = CountOutsideRange(1, 1)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(x) - 1

    def test_float_range(self):
        x = pd.Series(np.linspace(-3, 3, 10))

        primitive_instance = CountOutsideRange(-3, 3)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 0

        primitive_instance = CountOutsideRange(-0.34, 1.68)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 6

        primitive_instance = CountOutsideRange(-3, -3)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 9

    def test_nan(self):
        x = pd.Series(np.linspace(-3, 3, 10))
        x = pd.concat([x, pd.Series([np.nan] * 20)])
        primitive_instance = CountOutsideRange(-0.34, 1.68)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 6

        primitive_instance = CountOutsideRange(-3, 3, False)
        primitive_func = primitive_instance.get_function()
        assert np.isnan(primitive_func(x))

    def test_inf(self):
        x = pd.Series(np.linspace(-3, 3, 10))
        num_NINF = 20
        x = pd.concat([x, pd.Series([np.NINF] * num_NINF)])
        num_inf = 10
        x = pd.concat([x, pd.Series([np.inf] * num_inf)])

        primitive_instance = CountOutsideRange(-3, 3)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == num_inf + num_NINF

        primitive_instance = CountOutsideRange(-0.34, 1.68)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == 6 + num_inf + num_NINF

        primitive_instance = CountOutsideRange(np.NINF, 3)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == num_inf

        primitive_instance = CountOutsideRange(-3, np.inf)
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == num_NINF


================================================
FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_max_consecutive.py
================================================
import numpy as np
import pandas as pd
import pytest

from featuretools.primitives import (
    MaxConsecutiveFalse,
    MaxConsecutiveNegatives,
    MaxConsecutivePositives,
    MaxConsecutiveTrue,
    MaxConsecutiveZeros,
)


class TestMaxConsecutiveFalse:
    def test_regular(self):
        primitive_instance = MaxConsecutiveFalse()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([False, False, False, True, True, False, True], dtype="bool")
        assert primitive_func(array) == 3

    def test_all_true(self):
        primitive_instance = MaxConsecutiveFalse()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([True, True, True, True], dtype="bool")
        assert primitive_func(array) == 0

    def test_all_false(self):
        primitive_instance = MaxConsecutiveFalse()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([False, False, False], dtype="bool")
        assert primitive_func(array) == 3


class TestMaxConsecutiveTrue:
    def test_regular(self):
        primitive_instance = MaxConsecutiveTrue()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([True, False, True, True, True, False, True], dtype="bool")
        assert primitive_func(array) == 3

    def test_all_true(self):
        primitive_instance = MaxConsecutiveTrue()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([True, True, True, True], dtype="bool")
        assert primitive_func(array) == 4

    def test_all_false(self):
        primitive_instance = MaxConsecutiveTrue()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([False, False, False], dtype="bool")
        assert primitive_func(array) == 0


@pytest.mark.parametrize("dtype", ["float64", "int64"])
class TestMaxConsecutiveNegatives:
    def test_regular(self, dtype):
        if dtype == "int64":
            pytest.skip("test array contains floats which are not supported int64")
        primitive_instance = MaxConsecutiveNegatives()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1.3, -3.4, -1, -4, 10, -1.7, -4.9], dtype=dtype)
        assert primitive_func(array) == 3

    def test_all_int(self, dtype):
        primitive_instance = MaxConsecutiveNegatives()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1, -1, 2, 4, -5], dtype=dtype)
        assert primitive_func(array) == 1

    def test_all_float(self, dtype):
        if dtype == "int64":
            pytest.skip("test array contains floats which are not supported int64")
        primitive_instance = MaxConsecutiveNegatives()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1.0, -1.0, -2.0, 0.0, 5.0], dtype=dtype)
        assert primitive_func(array) == 2

    def test_with_nan(self, dtype):
        if dtype == "int64":
            pytest.skip("nans not supported in int64")
        primitive_instance = MaxConsecutiveNegatives()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1, np.nan, -2, -3], dtype=dtype)
        assert primitive_func(array) == 2

    def test_with_nan_skipna(self, dtype):
        if dtype == "int64":
            pytest.skip("nans not supported in int64")
        primitive_instance = MaxConsecutiveNegatives(skipna=False)
        primitive_func = primitive_instance.get_function()
        array = pd.Series([-1, np.nan, -2, -3], dtype=dtype)
        assert primitive_func(array) == 2

    def test_all_nan(self, dtype):
        if dtype == "int64":
            pytest.skip("nans not supported in int64")
        primitive_instance = MaxConsecutiveNegatives()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype)
        assert np.isnan(primitive_func(array))

    def test_all_nan_skipna(self, dtype):
        if dtype == "int64":
            pytest.skip("nans not supported in int64")
        primitive_instance = MaxConsecutiveNegatives(skipna=True)
        primitive_func = primitive_instance.get_function()
        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype)
        assert np.isnan(primitive_func(array))


@pytest.mark.parametrize("dtype", ["float64", "int64"])
class TestMaxConsecutivePositives:
    def test_regular(self, dtype):
        if dtype == "int64":
            pytest.skip("test array contains floats which are not supported int64")
        primitive_instance = MaxConsecutivePositives()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1.3, -3.4, 1, 4, 10, -1.7, -4.9], dtype=dtype)
        assert primitive_func(array) == 3

    def test_all_int(self, dtype):
        primitive_instance = MaxConsecutivePositives()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1, -1, 2, 4, -5], dtype=dtype)
        assert primitive_func(array) == 2

    def test_all_float(self, dtype):
        if dtype == "int64":
            pytest.skip("test array contains floats which are not supported int64")
        primitive_instance = MaxConsecutivePositives()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1.0, -1.0, 2.0, 4.0, 5.0], dtype=dtype)
        assert primitive_func(array) == 3

    def test_with_nan(self, dtype):
        if dtype == "int64":
            pytest.skip("nans not supported in int64")
        primitive_instance = MaxConsecutivePositives()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1, np.nan, 2, -3], dtype=dtype)
        assert primitive_func(array) == 2

    def test_with_nan_skipna(self, dtype):
        if dtype == "int64":
            pytest.skip("nans not supported in int64")
        primitive_instance = MaxConsecutivePositives(skipna=False)
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1, np.nan, 2, -3], dtype=dtype)
        assert primitive_func(array) == 1

    def test_all_nan(self, dtype):
        if dtype == "int64":
            pytest.skip("nans not supported in int64")
        primitive_instance = MaxConsecutivePositives()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype)
        assert np.isnan(primitive_func(array))

    def test_all_nan_skipna(self, dtype):
        if dtype == "int64":
            pytest.skip("nans not supported in int64")
        primitive_instance = MaxConsecutivePositives(skipna=True)
        primitive_func = primitive_instance.get_function()
        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype)
        assert np.isnan(primitive_func(array))


@pytest.mark.parametrize("dtype", ["float64", "int64"])
class TestMaxConsecutiveZeros:
    def test_regular(self, dtype):
        if dtype == "int64":
            pytest.skip("test array contains floats which are not supported int64")
        primitive_instance = MaxConsecutiveZeros()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1.3, -3.4, 0, 0, 0.0, 1.7, -4.9], dtype=dtype)
        assert primitive_func(array) == 3

    def test_all_int(self, dtype):
        primitive_instance = MaxConsecutiveZeros()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1, -1, 0, 0, -5], dtype=dtype)
        assert primitive_func(array) == 2

    def test_all_float(self, dtype):
        if dtype == "int64":
            pytest.skip("test array contains floats which are not supported int64")
        primitive_instance = MaxConsecutiveZeros()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1.0, 0.0, 0.0, 0.0, -5.3], dtype=dtype)
        assert primitive_func(array) == 3

    def test_with_nan(self, dtype):
        if dtype == "int64":
            pytest.skip("nans not supported in int64")
        primitive_instance = MaxConsecutiveZeros()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([0, np.nan, 0, -3], dtype=dtype)
        assert primitive_func(array) == 2

    def test_with_nan_skipna(self, dtype):
        if dtype == "int64":
            pytest.skip("nans not supported in int64")
        primitive_instance = MaxConsecutiveZeros(skipna=False)
        primitive_func = primitive_instance.get_function()
        array = pd.Series([0, np.nan, 0, -3], dtype=dtype)
        assert primitive_func(array) == 1

    def test_all_nan(self, dtype):
        if dtype == "int64":
            pytest.skip("nans not supported in int64")
        primitive_instance = MaxConsecutiveZeros()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype)
        assert np.isnan(primitive_func(array))

    def test_all_nan_skipna(self, dtype):
        if dtype == "int64":
            pytest.skip("nans not supported in int64")
        primitive_instance = MaxConsecutiveZeros(skipna=True)
        primitive_func = primitive_instance.get_function()
        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype)
        assert np.isnan(primitive_func(array))


================================================
FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_num_consecutive.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import NumConsecutiveGreaterMean, NumConsecutiveLessMean


class TestNumConsecutiveGreaterMean:
    primitive = NumConsecutiveGreaterMean

    def test_continuous_range(self):
        x = pd.Series(range(10))
        longest_sequence = [5, 6, 7, 8, 9]
        primitive_instance = self.primitive()
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(longest_sequence)

    def test_subsequence_in_middle(self):
        x = pd.Series(
            [
                0.6,
                0.18,
                1.11,
                -0.19,
                0.25,
                -1.41,
                0.54,
                0.29,
                -1.59,
                1.67,
                1.19,
                0.44,
                2.39,
                -1.38,
                0.15,
                -1.16,
                1.54,
                -0.34,
                -1.41,
                0.58,
            ],
        )
        longest_sequence = [1.67, 1.19, 0.44, 2.39]
        primitive_instance = self.primitive()
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(longest_sequence)

    def test_subsequence_at_start(self):
        x = pd.Series(
            [
                1.67,
                1.19,
                0.44,
                2.39,
                -0.19,
                0.6,
                0.18,
                1.11,
                0.25,
                -1.41,
                0.54,
                0.29,
                -1.59,
                -1.38,
                0.15,
                -1.16,
                1.54,
                -0.34,
                -1.41,
                0.58,
            ],
        )
        longest_sequence = [1.67, 1.19, 0.44, 2.39]
        primitive_instance = self.primitive()
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(longest_sequence)

    def test_subsequence_at_end(self):
        x = pd.Series(
            [
                0.6,
                0.18,
                1.11,
                -0.19,
                0.25,
                -1.41,
                0.54,
                0.29,
                -1.59,
                -1.38,
                0.15,
                -1.16,
                1.54,
                -0.34,
                0.58,
                -1.41,
                1.67,
                1.19,
                0.44,
                2.39,
            ],
        )
        longest_sequence = [1.67, 1.19, 0.44, 2.39]
        primitive_instance = self.primitive()
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(longest_sequence)

    def test_nan(self):
        x = pd.Series(range(10))
        x = pd.concat([x, pd.Series([np.nan] * 20)])
        longest_sequence = [5, 6, 7, 8, 9]

        # test ignoring NaN values
        primitive_instance = self.primitive()
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(longest_sequence)

        # test skipna=False
        primitive_instance = self.primitive(skipna=False)
        primitive_func = primitive_instance.get_function()
        assert np.isnan(primitive_func(x))

    def test_inf(self):
        primitive_instance = self.primitive()
        primitive_func = primitive_instance.get_function()

        x = pd.Series(range(10))
        x = pd.concat([x, pd.Series([np.inf])])
        assert primitive_func(x) == 0

        x = pd.Series(range(10))
        x = pd.concat([x, pd.Series([np.NINF])])
        assert primitive_func(x) == 10

        x = pd.Series(range(10))
        x = pd.concat([x, pd.Series([np.NINF, np.inf, np.inf])])
        assert np.isnan(primitive_func(x))


class TestNumConsecutiveLessMean:
    primitive = NumConsecutiveLessMean

    def test_continuous_range(self):
        x = pd.Series(range(10))
        longest_sequence = [0, 1, 2, 3, 4]
        primitive_instance = self.primitive()
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(longest_sequence)

    def test_subsequence_in_middle(self):
        x = pd.Series(
            [
                0.6,
                0.18,
                1.11,
                -0.19,
                0.25,
                -1.41,
                0.54,
                0.29,
                -1.59,
                1.67,
                1.19,
                0.44,
                2.39,
                -1.38,
                0.15,
                -1.16,
                1.54,
                -0.34,
                -1.41,
                0.58,
            ],
        )
        longest_sequence = [-1.38, 0.15, -1.16]
        primitive_instance = self.primitive()
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(longest_sequence)

    def test_subsequence_at_start(self):
        x = pd.Series(
            [
                -1.38,
                0.15,
                -1.16,
                0.6,
                0.18,
                1.11,
                -0.19,
                0.25,
                -1.41,
                0.54,
                0.29,
                -1.59,
                1.67,
                1.19,
                0.44,
                2.39,
                1.54,
                -0.34,
                -1.41,
                0.58,
            ],
        )
        longest_sequence = [-1.38, 0.15, -1.16]
        primitive_instance = self.primitive()
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(longest_sequence)

    def test_subsequence_at_end(self):
        x = pd.Series(
            [
                0.6,
                0.18,
                1.11,
                -0.19,
                0.25,
                -1.41,
                0.54,
                0.29,
                -1.59,
                1.67,
                1.19,
                0.44,
                2.39,
                1.54,
                -0.34,
                -1.41,
                0.58,
                -1.38,
                0.15,
                -1.16,
            ],
        )
        longest_sequence = [-1.38, 0.15, -1.16]
        primitive_instance = self.primitive()
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(longest_sequence)

    def test_nan(self):
        x = pd.Series(range(10))
        x = pd.concat([x, pd.Series([np.nan] * 20)])
        longest_sequence = [0, 1, 2, 3, 4]

        # test ignoring NaN values
        primitive_instance = self.primitive()
        primitive_func = primitive_instance.get_function()
        assert primitive_func(x) == len(longest_sequence)

        # test skipna=False
        primitive_instance = self.primitive(skipna=False)
        primitive_func = primitive_instance.get_function()
        assert np.isnan(primitive_func(x))

    def test_inf(self):
        primitive_instance = self.primitive()
        primitive_func = primitive_instance.get_function()

        x = pd.Series(range(10))
        x = pd.concat([x, pd.Series([np.inf])])
        assert primitive_func(x) == 10

        x = pd.Series(range(10))
        x = pd.concat([x, pd.Series([np.NINF])])
        assert primitive_func(x) == 0

        x = pd.Series(range(10))
        x = pd.concat([x, pd.Series([np.NINF, np.inf, np.inf])])
        assert np.isnan(primitive_func(x))


================================================
FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_percent_true.py
================================================
import pandas as pd
from woodwork.logical_types import BooleanNullable

import featuretools as ft


def test_percent_true_default_value_with_dfs():
    es = ft.EntitySet(id="customer_data")

    customers_df = pd.DataFrame(data={"customer_id": [1, 2]})
    transactions_df = pd.DataFrame(
        data={"tx_id": [1], "customer_id": [1], "is_foo": [True]},
    )

    es.add_dataframe(
        dataframe_name="customers_df",
        dataframe=customers_df,
        index="customer_id",
    )
    es.add_dataframe(
        dataframe_name="transactions_df",
        dataframe=transactions_df,
        index="tx_id",
        logical_types={"is_foo": BooleanNullable},
    )

    es = es.add_relationship(
        "customers_df",
        "customer_id",
        "transactions_df",
        "customer_id",
    )

    feature_matrix, _ = ft.dfs(
        entityset=es,
        target_dataframe_name="customers_df",
        agg_primitives=["percent_true"],
    )

    assert pd.isna(feature_matrix["PERCENT_TRUE(transactions_df.is_foo)"][2])


================================================
FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_rolling_primitive.py
================================================
import numpy as np
import pandas as pd
import pytest

from featuretools.primitives import (
    RollingCount,
    RollingMax,
    RollingMean,
    RollingMin,
    RollingOutlierCount,
    RollingSTD,
    RollingTrend,
)
from featuretools.primitives.standard.transform.time_series.utils import (
    apply_rolling_agg_to_series,
)
from featuretools.tests.primitive_tests.utils import get_number_from_offset


@pytest.mark.parametrize(
    "window_length, gap",
    [
        (5, 2),
        (5, 0),
        ("5d", "7d"),
        ("5d", "0d"),
    ],
)
@pytest.mark.parametrize("min_periods", [1, 0, 2, 5])
def test_rolling_max(min_periods, window_length, gap, window_series):
    gap_num = get_number_from_offset(gap)
    window_length_num = get_number_from_offset(window_length)
    # Since we're using a uniform series we can check correctness using numeric parameters
    expected_vals = apply_rolling_agg_to_series(
        window_series,
        lambda x: x.max(),
        window_length_num,
        gap=gap_num,
        min_periods=min_periods,
    )

    primitive_instance = RollingMax(
        window_length=window_length,
        gap=gap,
        min_periods=min_periods,
    )
    primitive_func = primitive_instance.get_function()

    actual_vals = pd.Series(
        primitive_func(window_series.index, pd.Series(window_series.values)),
    )

    # Since min_periods of 0 is the same as min_periods of 1
    num_nans_from_min_periods = min_periods or 1

    assert actual_vals.isna().sum() == gap_num + num_nans_from_min_periods - 1
    pd.testing.assert_series_equal(pd.Series(expected_vals), actual_vals)


@pytest.mark.parametrize(
    "window_length, gap",
    [
        (5, 2),
        (5, 0),
        ("5d", "7d"),
        ("5d", "0d"),
    ],
)
@pytest.mark.parametrize("min_periods", [1, 0, 2, 5])
def test_rolling_min(min_periods, window_length, gap, window_series):
    gap_num = get_number_from_offset(gap)
    window_length_num = get_number_from_offset(window_length)

    # Since we're using a uniform series we can check correctness using numeric parameters
    expected_vals = apply_rolling_agg_to_series(
        window_series,
        lambda x: x.min(),
        window_length_num,
        gap=gap_num,
        min_periods=min_periods,
    )

    primitive_instance = RollingMin(
        window_length=window_length,
        gap=gap,
        min_periods=min_periods,
    )
    primitive_func = primitive_instance.get_function()

    actual_vals = pd.Series(
        primitive_func(window_series.index, pd.Series(window_series.values)),
    )

    # Since min_periods of 0 is the same as min_periods of 1
    num_nans_from_min_periods = min_periods or 1

    assert actual_vals.isna().sum() == gap_num + num_nans_from_min_periods - 1
    pd.testing.assert_series_equal(pd.Series(expected_vals), actual_vals)


@pytest.mark.parametrize(
    "window_length, gap",
    [
        (5, 2),
        (5, 0),
        ("5d", "7d"),
        ("5d", "0d"),
    ],
)
@pytest.mark.parametrize("min_periods", [1, 0, 2, 5])
def test_rolling_mean(min_periods, window_length, gap, window_series):
    gap_num = get_number_from_offset(gap)
    window_length_num = get_number_from_offset(window_length)

    # Since we're using a uniform series we can check correctness using numeric parameters
    expected_vals = apply_rolling_agg_to_series(
        window_series,
        np.mean,
        window_length_num,
        gap=gap_num,
        min_periods=min_periods,
    )

    primitive_instance = RollingMean(
        window_length=window_length,
        gap=gap,
        min_periods=min_periods,
    )
    primitive_func = primitive_instance.get_function()

    actual_vals = pd.Series(
        primitive_func(window_series.index, pd.Series(window_series.values)),
    )

    # Since min_periods of 0 is the same as min_periods of 1
    num_nans_from_min_periods = min_periods or 1

    assert actual_vals.isna().sum() == gap_num + num_nans_from_min_periods - 1
    pd.testing.assert_series_equal(pd.Series(expected_vals), actual_vals)


@pytest.mark.parametrize(
    "window_length, gap",
    [
        (5, 2),
        (5, 0),
        ("5d", "7d"),
        ("5d", "0d"),
    ],
)
@pytest.mark.parametrize("min_periods", [1, 0, 2, 5])
def test_rolling_std(min_periods, window_length, gap, window_series):
    gap_num = get_number_from_offset(gap)
    window_length_num = get_number_from_offset(window_length)

    # Since we're using a uniform series we can check correctness using numeric parameters
    expected_vals = apply_rolling_agg_to_series(
        window_series,
        lambda x: x.std(),
        window_length_num,
        gap=gap_num,
        min_periods=min_periods,
    )

    primitive_instance = RollingSTD(
        window_length=window_length,
        gap=gap,
        min_periods=min_periods,
    )
    primitive_func = primitive_instance.get_function()

    actual_vals = pd.Series(
        primitive_func(window_series.index, pd.Series(window_series.values)),
    )

    # Since min_periods of 0 is the same as min_periods of 1
    num_nans_from_min_periods = min_periods or 2

    if min_periods in [0, 1]:
        # the additional nan is because std pandas function returns NaN if there's only one value
        num_nans = gap_num + 1
    else:
        num_nans = gap_num + num_nans_from_min_periods - 1

    # The extra 1 at the beginning is because the std pandas function returns NaN if there's only one value
    assert actual_vals.isna().sum() == num_nans
    pd.testing.assert_series_equal(pd.Series(expected_vals), actual_vals)


@pytest.mark.parametrize(
    "window_length, gap",
    [
        (5, 2),
        ("6d", "7d"),
    ],
)
def test_rolling_count(window_length, gap, window_series):
    gap_num = get_number_from_offset(gap)
    window_length_num = get_number_from_offset(window_length)

    expected_vals = apply_rolling_agg_to_series(
        window_series,
        lambda x: x.count(),
        window_length_num,
        gap=gap_num,
    )

    primitive_instance = RollingCount(
        window_length=window_length,
        gap=gap,
        min_periods=window_length_num,
    )
    primitive_func = primitive_instance.get_function()

    actual_vals = pd.Series(primitive_func(window_series.index))

    num_nans = gap_num + window_length_num - 1
    assert actual_vals.isna().sum() == num_nans
    # RollingCount will not match the exact roll_series_with_gap call,
    # because it handles the min_periods difference within the primitive
    pd.testing.assert_series_equal(
        pd.Series(expected_vals).iloc[num_nans:],
        actual_vals.iloc[num_nans:],
    )


@pytest.mark.parametrize(
    "min_periods, expected_num_nams",
    [(0, 2), (1, 2), (3, 4), (5, 6)],  # 0 and 1 get treated the same
)
@pytest.mark.parametrize("window_length, gap", [("5d", "2d"), (5, 2)])
def test_rolling_count_primitive_min_periods_nans(
    window_length,
    gap,
    min_periods,
    expected_num_nams,
    window_series,
):
    primitive_instance = RollingCount(
        window_length=window_length,
        gap=gap,
        min_periods=min_periods,
    )
    primitive_func = primitive_instance.get_function()
    vals = pd.Series(primitive_func(window_series.index))

    assert vals.isna().sum() == expected_num_nams


@pytest.mark.parametrize(
    "min_periods, expected_num_nams",
    [(0, 0), (1, 0), (3, 2), (5, 4)],  # 0 and 1 get treated the same
)
@pytest.mark.parametrize("window_length, gap", [("5d", "0d"), (5, 0)])
def test_rolling_count_with_no_gap(
    window_length,
    gap,
    min_periods,
    expected_num_nams,
    window_series,
):
    primitive_instance = RollingCount(
        window_length=window_length,
        gap=gap,
        min_periods=min_periods,
    )
    primitive_func = primitive_instance.get_function()
    vals = pd.Series(primitive_func(window_series.index))

    assert vals.isna().sum() == expected_num_nams


@pytest.mark.parametrize(
    "window_length, gap, expected_vals",
    [
        (3, 0, [np.nan, np.nan, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
        (
            4,
            1,
            [np.nan, np.nan, np.nan, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        ),
        (
            "5d",
            "7d",
            [
                np.nan,
                np.nan,
                np.nan,
                np.nan,
                np.nan,
                np.nan,
                np.nan,
                np.nan,
                np.nan,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
                1,
            ],
        ),
        (
            "5d",
            "0d",
            [np.nan, np.nan, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        ),
    ],
)
def test_rolling_trend(window_length, gap, expected_vals, window_series):
    primitive_instance = RollingTrend(window_length=window_length, gap=gap)

    actual_vals = primitive_instance(window_series.index, window_series.values)

    pd.testing.assert_series_equal(pd.Series(expected_vals), pd.Series(actual_vals))


def test_rolling_trend_window_length_less_than_three(window_series):
    primitive_instance = RollingTrend(window_length=2)

    vals = primitive_instance(window_series.index, window_series.values)

    for v in vals:
        assert np.isnan(v)


@pytest.mark.parametrize(
    "primitive",
    [
        RollingCount,
        RollingMax,
        RollingMin,
        RollingMean,
        RollingOutlierCount,
    ],
)
def test_rolling_primitives_non_uniform(primitive):
    # When the data isn't uniform, this impacts the number of values in each rolling window
    datetimes = (
        list(pd.date_range(start="2017-01-01", freq="1d", periods=3))
        + list(pd.date_range(start="2017-01-10", freq="2d", periods=4))
        + list(pd.date_range(start="2017-01-22", freq="1d", periods=7))
    )
    no_freq_series = pd.Series(range(len(datetimes)), index=datetimes)

    # Should match RollingCount exactly and have same nan values as other primitives
    expected_series = pd.Series(
        [None, 1, 2] + [None, 1, 1, 1] + [None, 1, 2, 3, 3, 3, 3],
    )

    primitive_instance = primitive(window_length="3d", gap="1d")
    if isinstance(primitive_instance, RollingCount):
        rolled_series = pd.Series(primitive_instance(no_freq_series.index))
        pd.testing.assert_series_equal(rolled_series, expected_series)
    else:
        rolled_series = pd.Series(
            primitive_instance(no_freq_series.index, pd.Series(no_freq_series.values)),
        )
        pd.testing.assert_series_equal(expected_series.isna(), rolled_series.isna())


def test_rolling_std_non_uniform():
    # When the data isn't uniform, this impacts the number of values in each rolling window
    datetimes = (
        list(pd.date_range(start="2017-01-01", freq="1d", periods=3))
        + list(pd.date_range(start="2017-01-10", freq="2d", periods=4))
        + list(pd.date_range(start="2017-01-22", freq="1d", periods=7))
    )
    no_freq_series = pd.Series(range(len(datetimes)), index=datetimes)

    # There will be at least two null values at the beginning of each range's rows, the first for the
    # row skipped by the gap, and the second because pandas' std returns NaN if there's only one row
    expected_series = pd.Series(
        [None, None, 0.707107]
        + [None, None, None, None]
        + [  # Because the freq was 2 days, there will never be more than 1 observation
            None,
            None,
            0.707107,
            1.0,
            1.0,
            1.0,
            1.0,
        ],
    )

    primitive_instance = RollingSTD(window_length="3d", gap="1d")
    rolled_series = pd.Series(
        primitive_instance(no_freq_series.index, pd.Series(no_freq_series.values)),
    )

    pd.testing.assert_series_equal(rolled_series, expected_series)


def test_rolling_trend_non_uniform():
    datetimes = (
        list(pd.date_range(start="2017-01-01", freq="1d", periods=3))
        + list(pd.date_range(start="2017-01-10", freq="2d", periods=4))
        + list(pd.date_range(start="2017-01-22", freq="1d", periods=7))
    )
    no_freq_series = pd.Series(range(len(datetimes)), index=datetimes)
    expected_series = pd.Series(
        [None, None, None]
        + [None, None, None, None]
        + [
            None,
            None,
            None,
            1.0,
            1.0,
            1.0,
            1.0,
        ],
    )
    primitive_instance = RollingTrend(window_length="3d", gap="1d")
    rolled_series = pd.Series(
        primitive_instance(no_freq_series.index, pd.Series(no_freq_series.values)),
    )
    pd.testing.assert_series_equal(rolled_series, expected_series)


@pytest.mark.parametrize(
    "window_length, gap",
    [
        (5, 2),
        (5, 0),
        ("5d", "7d"),
        ("5d", "0d"),
    ],
)
@pytest.mark.parametrize(
    "min_periods",
    [1, 0, 2, 5],
)
def test_rolling_outlier_count(
    min_periods,
    window_length,
    gap,
    rolling_outlier_series,
):
    primitive_instance = RollingOutlierCount(
        window_length=window_length,
        gap=gap,
        min_periods=min_periods,
    )

    primitive_func = primitive_instance.get_function()

    actual_vals = pd.Series(
        primitive_func(
            rolling_outlier_series.index,
            pd.Series(rolling_outlier_series.values),
        ),
    )

    expected_vals = apply_rolling_agg_to_series(
        series=rolling_outlier_series,
        agg_func=primitive_instance.get_outliers_count,
        window_length=window_length,
        gap=gap,
        min_periods=min_periods,
    )

    # Since min_periods of 0 is the same as min_periods of 1
    num_nans_from_min_periods = min_periods or 1
    assert (
        actual_vals.isna().sum()
        == get_number_from_offset(gap) + num_nans_from_min_periods - 1
    )
    pd.testing.assert_series_equal(actual_vals, pd.Series(data=expected_vals))


================================================
FILE: featuretools/tests/primitive_tests/aggregation_primitive_tests/test_time_since.py
================================================
from datetime import datetime
from math import isnan

import numpy as np
import pandas as pd

from featuretools.primitives import (
    TimeSinceLastFalse,
    TimeSinceLastMax,
    TimeSinceLastMin,
    TimeSinceLastTrue,
)


class TestTimeSinceLastFalse:
    primitive = TimeSinceLastFalse
    cutoff_time = datetime(2011, 4, 9, 11, 31, 27)
    times = pd.Series(
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]
        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)],
    )
    booleans = pd.Series([True] * 5 + [False] * 4)

    def test_booleans(self):
        primitive_func = self.primitive().get_function()
        answer = self.cutoff_time - datetime(2011, 4, 9, 10, 31, 27)
        assert (
            primitive_func(
                self.times,
                self.booleans,
                time=self.cutoff_time,
            )
            == answer.total_seconds()
        )

    def test_booleans_reversed(self):
        primitive_func = self.primitive().get_function()
        answer = self.cutoff_time - datetime(2011, 4, 9, 10, 30, 18)
        reversed_booleans = pd.Series(self.booleans.values[::-1])
        assert (
            primitive_func(
                self.times,
                reversed_booleans,
                time=self.cutoff_time,
            )
            == answer.total_seconds()
        )

    def test_no_false(self):
        primitive_func = self.primitive().get_function()
        times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)])
        booleans = pd.Series([True] * 5)
        assert isnan(primitive_func(times, booleans, time=self.cutoff_time))

    def test_nans(self):
        primitive_func = self.primitive().get_function()
        times = pd.concat([self.times.copy(), pd.Series([np.nan, pd.NaT])])
        booleans = pd.concat(
            [self.booleans.copy(), pd.Series([np.nan], dtype="boolean")],
        )
        times = times.reset_index(drop=True)
        booleans = booleans.reset_index(drop=True)
        answer = self.cutoff_time - datetime(2011, 4, 9, 10, 31, 27)
        assert (
            primitive_func(
                times,
                booleans,
                time=self.cutoff_time,
            )
            == answer.total_seconds()
        )

    def test_empty(self):
        primitive_func = self.primitive().get_function()
        times = pd.Series([], dtype="datetime64[ns]")
        booleans = pd.Series([], dtype="boolean")
        times = times.reset_index(drop=True)
        answer = primitive_func(
            times,
            booleans,
            time=self.cutoff_time,
        )
        assert pd.isna(answer)


class TestTimeSinceLastMax:
    primitive = TimeSinceLastMax
    cutoff_time = datetime(2011, 4, 9, 11, 31, 27)
    times = pd.Series(
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]
        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)],
    )
    numerics = pd.Series([0, 1, 2, 8, 2, 5, 1, 3, 7])
    actual_time_since = cutoff_time - datetime(2011, 4, 9, 10, 30, 18)
    actual_seconds = actual_time_since.total_seconds()

    def test_primitive_func_1(self):
        primitive_func = self.primitive().get_function()
        assert (
            primitive_func(
                self.times,
                self.numerics,
                time=self.cutoff_time,
            )
            == self.actual_seconds
        )

    def test_no_max(self):
        primitive_func = self.primitive().get_function()
        times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)])
        numerics = pd.Series([0] * 5)
        actual_time_since = self.cutoff_time - datetime(2011, 4, 9, 10, 30, 0)
        actual_seconds = actual_time_since.total_seconds()
        assert primitive_func(times, numerics, time=self.cutoff_time) == actual_seconds

    def test_nans(self):
        primitive_func = self.primitive().get_function()
        times = pd.concat([self.times.copy(), pd.Series([np.nan, pd.NaT])])
        numerics = pd.concat(
            [self.numerics.copy(), pd.Series([np.nan], dtype="float64")],
        )
        times = times.reset_index(drop=True)
        numerics = numerics.reset_index(drop=True)
        assert (
            primitive_func(
                times,
                numerics,
                time=self.cutoff_time,
            )
            == self.actual_seconds
        )


class TestTimeSinceLastMin:
    primitive = TimeSinceLastMin
    cutoff_time = datetime(2011, 4, 9, 11, 31, 27)
    times = pd.Series(
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]
        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)],
    )
    numerics = pd.Series([1, 0, 2, 8, 2, 5, 1, 3, 7])
    actual_time_since = cutoff_time - datetime(2011, 4, 9, 10, 30, 6)
    actual_seconds = actual_time_since.total_seconds()

    def test_primitive_func_1(self):
        primitive_func = self.primitive().get_function()
        assert (
            primitive_func(
                self.times,
                self.numerics,
                time=self.cutoff_time,
            )
            == self.actual_seconds
        )

    def test_no_max(self):
        primitive_func = self.primitive().get_function()
        times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)])
        numerics = pd.Series([0] * 5)
        actual_time_since = self.cutoff_time - datetime(2011, 4, 9, 10, 30, 0)
        actual_seconds = actual_time_since.total_seconds()
        assert primitive_func(times, numerics, time=self.cutoff_time) == actual_seconds

    def test_nans(self):
        primitive_func = self.primitive().get_function()
        times = pd.concat(
            [self.times.copy(), pd.Series([np.nan, pd.NaT], dtype="datetime64[ns]")],
        )
        numerics = pd.concat(
            [self.numerics.copy(), pd.Series([np.nan, np.nan], dtype="float64")],
        )
        times = times.reset_index(drop=True)
        numerics = numerics.reset_index(drop=True)
        assert (
            primitive_func(
                times,
                numerics,
                time=self.cutoff_time,
            )
            == self.actual_seconds
        )


class TestTimeSinceLastTrue:
    primitive = TimeSinceLastTrue
    cutoff_time = datetime(2011, 4, 9, 11, 31, 27)
    times = pd.Series(
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]
        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)],
    )
    booleans = pd.Series([True] * 5 + [False] * 4)
    actual_time_since = cutoff_time - datetime(2011, 4, 9, 10, 30, 24)
    actual_seconds = actual_time_since.total_seconds()

    def test_primitive_func_1(self):
        primitive_func = self.primitive().get_function()
        assert (
            primitive_func(
                self.times,
                self.booleans,
                time=self.cutoff_time,
            )
            == self.actual_seconds
        )

    def test_no_true(self):
        primitive_func = self.primitive().get_function()
        times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)])
        booleans = pd.Series([False] * 5)
        assert isnan(primitive_func(times, booleans, time=self.cutoff_time))

    def test_nans(self):
        primitive_func = self.primitive().get_function()
        times = pd.concat(
            [self.times.copy(), pd.Series([np.nan, pd.NaT], dtype="datetime64[ns]")],
        )
        booleans = pd.concat(
            [self.booleans.copy(), pd.Series([np.nan], dtype="boolean")],
        )
        times = times.reset_index(drop=True)
        booleans = booleans.reset_index(drop=True)
        assert (
            primitive_func(
                times,
                booleans,
                time=self.cutoff_time,
            )
            == self.actual_seconds
        )

    def test_no_cutofftime(self):
        primitive_func = self.primitive().get_function()
        times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)])
        booleans = pd.Series([False] * 5)
        assert isnan(primitive_func(times, booleans))

    def test_empty(self):
        primitive_func = self.primitive().get_function()
        times = pd.Series([], dtype="datetime64[ns]")
        booleans = pd.Series([], dtype="boolean")
        times = times.reset_index(drop=True)
        answer = primitive_func(
            times,
            booleans,
            time=self.cutoff_time,
        )
        assert pd.isna(answer)


================================================
FILE: featuretools/tests/primitive_tests/bad_primitive_files/__init__.py
================================================


================================================
FILE: featuretools/tests/primitive_tests/bad_primitive_files/multiple_primitives.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives import AggregationPrimitive


class CustomMax(AggregationPrimitive):
    name = "custom_max"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})


class CustomSum(AggregationPrimitive):
    name = "custom_sum"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})


================================================
FILE: featuretools/tests/primitive_tests/bad_primitive_files/no_primitives.py
================================================


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/__init__.py
================================================


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_count_string.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import CountString
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestCountString(PrimitiveTestBase):
    primitive = CountString

    def compare(self, primitive_initiated, test_cases, answers):
        primitive_func = primitive_initiated.get_function()
        primitive_answers = primitive_func(test_cases)
        return np.testing.assert_array_equal(answers, primitive_answers)

    test_cases = pd.Series(
        [
            # Ignore case
            "Hello other words hello hEllo HELLO",
            # ignore non alphanumeric
            "he\\{ll\t\n\t.--?o othe/r words hello hello h.el./lo",
            # match whole word
            "hellohellohello other hello word go hello here 9hello hello9",
            # all combined
            #   hello/ counts as hello being it's own word
            #   since * and / are non word characters
            #   but 9 is a "word character" so 9hello9
            #   does not count as hello being its own word
            "helloHellohello 9Hello 9hello9 *hello/ test'hel..lo' 'hE.l.lO' \
         hello",
        ],
    )

    def test_non_regex_with_no_other_parameters(self):
        primitive = self.primitive(
            "hello",
            ignore_case=False,
            ignore_non_alphanumeric=False,
            is_regex=False,
            match_whole_words_only=False,
        )
        answers = [1, 2, 7, 5]
        self.compare(primitive, self.test_cases, answers)

    def test_non_regex_ignore_case(self):
        primitive1 = self.primitive(
            "hello",
            ignore_case=True,
            ignore_non_alphanumeric=False,
            is_regex=False,
            match_whole_words_only=False,
        )

        primitive2 = self.primitive(
            "HeLLo",
            ignore_case=True,
            ignore_non_alphanumeric=False,
            is_regex=False,
            match_whole_words_only=False,
        )

        answers = [4, 2, 7, 7]
        self.compare(primitive1, self.test_cases, answers)
        self.compare(primitive2, self.test_cases, answers)

    def test_non_regex_ignore_non_alphanumeric(self):
        primitive = self.primitive(
            "hello",
            ignore_case=False,
            ignore_non_alphanumeric=True,
            is_regex=False,
            match_whole_words_only=False,
        )
        answers = [1, 4, 7, 6]
        self.compare(primitive, self.test_cases, answers)

    def test_non_regex_match_whole_words_only(self):
        primitive = self.primitive(
            "hello",
            ignore_case=False,
            ignore_non_alphanumeric=False,
            is_regex=False,
            match_whole_words_only=True,
        )

        answers = [1, 2, 2, 2]
        self.compare(primitive, self.test_cases, answers)

    def test_non_regex_with_all_others_parameters(self):
        primitive = self.primitive(
            "hello",
            ignore_case=True,
            ignore_non_alphanumeric=True,
            is_regex=False,
            match_whole_words_only=True,
        )

        answers = [4, 4, 2, 3]
        self.compare(primitive, self.test_cases, answers)

    def test_regex_with_no_other_parameters(self):
        primitive = self.primitive(
            "h.l.o",
            ignore_case=False,
            ignore_non_alphanumeric=False,
            is_regex=True,
            match_whole_words_only=False,
        )

        answers = [2, 2, 7, 5]
        self.compare(primitive, self.test_cases, answers)

    def test_regex_with_ignore_case(self):
        primitive = self.primitive(
            "h.l.o",
            ignore_case=True,
            ignore_non_alphanumeric=False,
            is_regex=True,
            match_whole_words_only=False,
        )

        answers = [4, 2, 7, 7]
        self.compare(primitive, self.test_cases, answers)

    def test_regex_with_ignore_non_alphanumeric(self):
        primitive = self.primitive(
            "h.l.o",
            ignore_case=False,
            ignore_non_alphanumeric=True,
            is_regex=True,
            match_whole_words_only=False,
        )

        answers = [2, 4, 7, 6]
        self.compare(primitive, self.test_cases, answers)

    def test_regex_with_match_whole_words_only(self):
        primitive = self.primitive(
            "h.l.o",
            ignore_case=False,
            ignore_non_alphanumeric=False,
            is_regex=True,
            match_whole_words_only=True,
        )

        answers = [2, 2, 2, 2]
        self.compare(primitive, self.test_cases, answers)

    def test_regex_with_all_other_parameters(self):
        primitive = self.primitive(
            "h.l.o",
            ignore_case=True,
            ignore_non_alphanumeric=True,
            is_regex=True,
            match_whole_words_only=True,
        )

        answers = [4, 4, 2, 3]
        self.compare(primitive, self.test_cases, answers)

    def test_overlapping_regex(self):
        primitive = self.primitive(
            "(?=(a.*a))",
            ignore_case=True,
            ignore_non_alphanumeric=True,
            is_regex=True,
            match_whole_words_only=False,
        )
        test_cases = pd.Series(["aaaaaaaaaa", "atesta aa aa a"])
        answers = [9, 6]
        self.compare(primitive, test_cases, answers)

    def test_the(self):
        primitive = self.primitive(
            "the",
            ignore_case=True,
            ignore_non_alphanumeric=False,
            is_regex=False,
            match_whole_words_only=False,
        )
        test_cases = pd.Series(["The fox jumped over the cat", "The there then"])

        answers = [2, 3]
        self.compare(primitive, test_cases, answers)

    def test_nan(self):
        primitive = self.primitive(
            "the",
            ignore_case=True,
            ignore_non_alphanumeric=False,
            is_regex=False,
            match_whole_words_only=False,
        )
        test_cases = pd.Series(
            [np.nan, None, pd.NA, "The fox jumped over the cat", "The there then"],
        )
        answers = [np.nan, np.nan, np.nan, 2, 3]
        self.compare(primitive, test_cases, answers)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive(
            "the",
            ignore_case=True,
            ignore_non_alphanumeric=False,
            is_regex=False,
            match_whole_words_only=False,
        )
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)

    def test_with_featuretools_nan(self, es):
        log_df = es["log"]
        comments = log_df["comments"]
        comments[1] = pd.NA
        comments[2] = np.nan
        comments[3] = None
        log_df["comments"] = comments
        es.replace_dataframe(dataframe_name="log", df=log_df)

        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive(
            "the",
            ignore_case=True,
            ignore_non_alphanumeric=False,
            is_regex=False,
            match_whole_words_only=False,
        )
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_mean_characters_per_word.py
================================================
import numpy as np
import pandas as pd
import pytest

from featuretools.primitives import MeanCharactersPerWord
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestMeanCharactersPerWord(PrimitiveTestBase):
    primitive = MeanCharactersPerWord

    def test_sentences(self):
        x = pd.Series(
            [
                "This is a test file",
                "This is second line",
                "third line $1,000",
                "and subsequent lines",
                "and more",
            ],
        )
        primitive_func = self.primitive().get_function()
        answers = pd.Series([3.0, 4.0, 5.0, 6.0, 3.5])
        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)

    def test_punctuation(self):
        x = pd.Series(
            [
                "This: is a test file",
                "This, is second line?",
                "third/line $1,000;",
                "and--subsequen't lines...",
                "*and, more..",
            ],
        )
        primitive_func = self.primitive().get_function()
        answers = pd.Series([3.0, 4.0, 8.0, 10.5, 4.0])
        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)

    def test_multiline(self):
        x = pd.Series(
            [
                "This is a test file",
                "This is second line\nthird line $1000;\nand subsequent lines",
                "and more",
            ],
        )
        primitive_func = self.primitive().get_function()
        answers = pd.Series([3.0, 4.8, 3.5])
        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)

    @pytest.mark.parametrize(
        "na_value",
        [None, np.nan, pd.NA],
    )
    def test_nans(self, na_value):
        x = pd.Series([na_value, "", "third line"])
        primitive_func = self.primitive().get_function()
        answers = pd.Series([np.nan, 0, 4.5])
        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)

    @pytest.mark.parametrize(
        "na_value",
        [None, np.nan, pd.NA],
    )
    def test_all_nans(self, na_value):
        x = pd.Series([na_value, na_value, na_value])
        primitive_func = self.primitive().get_function()
        answers = pd.Series([np.nan, np.nan, np.nan])
        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_median_word_length.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import MedianWordLength
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestMedianWordLength(PrimitiveTestBase):
    primitive = MedianWordLength

    def test_delimiter_override(self):
        x = pd.Series(
            ["This is a test file.", "This,is,second,line?", "and;subsequent;lines..."],
        )

        expected = pd.Series([4.0, 4.5, 8.0])
        actual = self.primitive("[ ,;]").get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_multiline(self):
        x = pd.Series(
            [
                "This is a test file.",
                "This is second line\nthird line $1000;\nand subsequent lines",
            ],
        )

        expected = pd.Series([4.0, 4.5])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_null(self):
        x = pd.Series([np.nan, pd.NA, None, "This is a test file."])

        actual = self.primitive().get_function()(x)
        expected = pd.Series([np.nan, np.nan, np.nan, 4.0])
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_natural_language_primitives_terminate.py
================================================
import pandas as pd
import pytest

from featuretools.primitives.utils import _get_natural_language_primitives

TIMEOUT_THRESHOLD = 20


class TestNaturalLanguagePrimitivesTerminate:
    # need to sort primitives to avoid pytest collection error
    primitives = sorted(_get_natural_language_primitives().items())

    @pytest.mark.timeout(TIMEOUT_THRESHOLD)
    @pytest.mark.parametrize("primitive", [prim for _, prim in primitives])
    def test_natlang_primitive_does_not_timeout(
        self,
        strings_that_have_triggered_errors_before,
        primitive,
    ):
        for text in strings_that_have_triggered_errors_before:
            primitive().get_function()(pd.Series(text))


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_num_characters.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import NumCharacters
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestNumCharacters(PrimitiveTestBase):
    primitive = NumCharacters

    def test_general(self):
        x = pd.Series(
            [
                "test test test test",
                "test TEST test TEST,test test test",
                "and subsequent lines...",
            ],
        )
        expected = pd.Series([19, 34, 23])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_special_characters_and_whitespace(self):
        x = pd.Series(["50% 50 50% \t\t\t\n\n", "$5,3040 a test* test"])
        expected = pd.Series([16, 20])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_unicode_input(self):
        x = pd.Series(
            [
                "Ángel Angel Ángel ángel",
            ],
        )
        expected = pd.Series([23])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_null(self):
        x = pd.Series([np.nan, pd.NA, None, "This is a test file."])
        actual = self.primitive().get_function()(x)
        expected = pd.Series([pd.NA, pd.NA, pd.NA, 20])
        pd.testing.assert_series_equal(
            actual,
            expected,
            check_names=False,
            check_dtype=False,
        )

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_num_unique_separators.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import NumUniqueSeparators
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestNumUniqueSeparators(PrimitiveTestBase):
    primitive = NumUniqueSeparators

    def test_punctuation(self):
        x = pd.Series(
            [
                "This: is a test file",
                "This, is second line?",
                "third/line $1,000;",
                "and--subsequen't lines...",
                "*and, more..",
            ],
        )
        primitive_func = self.primitive().get_function()
        answers = pd.Series([1, 3, 3, 2, 3])
        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)

    def test_other_delimeters(self):
        x = pd.Series(["@#$%^&*()<>/[]\\`~-_=+"])
        primitive_func = self.primitive().get_function()
        answers = pd.Series([0])
        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)

    def test_multiline(self):
        x = pd.Series(
            [
                "This is a test file",
                "This is second line\nthird line $1000;\nand subsequent lines",
                "and more!",
            ],
        )
        primitive_func = self.primitive().get_function()
        answers = pd.Series([1, 3, 2])
        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)

    def test_nans(self):
        x = pd.Series([np.nan, "", "third line."])
        primitive_func = self.primitive().get_function()
        answers = pd.Series([pd.NA, 0, 2])
        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_num_words.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import NumWords
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestNumWords(PrimitiveTestBase):
    primitive = NumWords

    def test_general(self):
        x = pd.Series(
            [
                "test test test test",
                "test TEST test TEST,test test test",
                "and subsequent lines...",
            ],
        )
        expected = pd.Series([4, 6, 3])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_special_characters_and_whitespace(self):
        x = pd.Series(["50% 50 50% \t\t\t\n\n", "$5,3040 a test* test"])
        expected = pd.Series([3, 4])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_unicode_input(self):
        x = pd.Series(
            [
                "Ángel Angel Ángel ángel",
            ],
        )
        expected = pd.Series([4])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_contractions(self):
        x = pd.Series(
            [
                "can't won't don't can't aren't won't don't they'd there's",
            ],
        )
        expected = pd.Series([9])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_multiple_spaces(self):
        x = pd.Series(
            [
                "    word  word            word word     .",
                "This is                      \nthird line \nthird line",
            ],
        )
        expected = pd.Series([4, 6])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_null(self):
        x = pd.Series([np.nan, pd.NA, None, "This is a test file."])
        actual = self.primitive().get_function()(x)
        expected = pd.Series([pd.NA, pd.NA, pd.NA, 5])
        pd.testing.assert_series_equal(
            actual,
            expected,
            check_names=False,
            check_dtype=False,
        )

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_common_words.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import NumberOfCommonWords
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestNumberOfCommonWords(PrimitiveTestBase):
    primitive = NumberOfCommonWords
    test_word_bank = {"and", "a", "is"}

    def test_delimiter_override(self):
        x = pd.Series(
            [
                "This is a test file.",
                "This,is,second,line, and?",
                "and;subsequent;lines...",
            ],
        )

        expected = pd.Series([2, 2, 1])
        actual = self.primitive(
            word_set=self.test_word_bank,
            delimiters_regex="[ ,;]",
        ).get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_multiline(self):
        x = pd.Series(
            [
                "This is a test file.",
                "This is second line\nthird line $1000;\nand subsequent lines",
            ],
        )

        expected = pd.Series([2, 2])
        actual = self.primitive(self.test_word_bank).get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_null(self):
        x = pd.Series([np.nan, pd.NA, None, "This is a test file."])

        actual = self.primitive(self.test_word_bank).get_function()(x)
        expected = pd.Series([pd.NA, pd.NA, pd.NA, 2])
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_case_insensitive(self):
        x = pd.Series(["Is", "a", "AND"])

        actual = self.primitive(self.test_word_bank).get_function()(x)
        expected = pd.Series([1, 1, 1])
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_hashtags.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import NumberOfHashtags
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestNumberOfHashtags(PrimitiveTestBase):
    primitive = NumberOfHashtags

    def test_regular_input(self):
        x = pd.Series(
            [
                "#hello #hi #hello",
                "#regular#expression#0or1#yes",
                "andorandorand #32309",
            ],
        )
        expected = [3.0, 0.0, 0.0]
        actual = self.primitive().get_function()(x)
        np.testing.assert_array_equal(actual, expected)

    def test_unicode_input(self):
        x = pd.Series(
            [
                "#Ángel #Æ #ĘÁÊÚ",
                "#############Āndandandandand###",
                "andorandorand #32309",
            ],
        )
        expected = [3.0, 0.0, 0.0]
        actual = self.primitive().get_function()(x)
        np.testing.assert_array_equal(actual, expected)

    def test_multiline(self):
        x = pd.Series(
            [
                "#\n\t\n",
                "#hashtag\n#hashtag2\n#\n\n",
            ],
        )

        expected = [0.0, 2.0]
        actual = self.primitive().get_function()(x)
        np.testing.assert_array_equal(actual, expected)

    def test_null(self):
        x = pd.Series([np.nan, pd.NA, None, "#test"])

        actual = self.primitive().get_function()(x)
        expected = [np.nan, np.nan, np.nan, 1.0]
        np.testing.assert_array_equal(actual, expected)

    def test_alphanumeric_and_special(self):
        x = pd.Series(["#1or0", "#12", "#??!>@?@#>"])

        actual = self.primitive().get_function()(x)
        expected = [1.0, 0.0, 0.0]
        np.testing.assert_array_equal(actual, expected)

    def test_underscore(self):
        x = pd.Series(["#no", "#__yes", "#??!>@?@#>"])

        actual = self.primitive().get_function()(x)
        expected = [1.0, 1.0, 0.0]
        np.testing.assert_array_equal(actual, expected)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_mentions.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import NumberOfMentions
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestNumberOfMentions(PrimitiveTestBase):
    primitive = NumberOfMentions

    def test_regular_input(self):
        x = pd.Series(
            [
                "@hello @hi @hello",
                "@and@",
                "andorandorand",
            ],
        )
        expected = [3.0, 0.0, 0.0]
        actual = self.primitive().get_function()(x)
        np.testing.assert_array_equal(actual, expected)

    def test_unicode_input(self):
        x = pd.Series(
            [
                "@Ángel @Æ @ĘÁÊÚ",
                "@@@@Āndandandandand@",
                "andorandorand @32309",
                "example@gmail.com",
                "@example-20329",
            ],
        )
        expected = [3.0, 0.0, 1.0, 0.0, 1.0]
        actual = self.primitive().get_function()(x)
        np.testing.assert_array_equal(actual, expected)

    def test_multiline(self):
        x = pd.Series(
            [
                "@\n\t\n",
                "@mention\n @mention2\n@\n\n",
            ],
        )

        expected = [0.0, 2.0]
        actual = self.primitive().get_function()(x)
        np.testing.assert_array_equal(actual, expected)

    def test_null(self):
        x = pd.Series([np.nan, pd.NA, None, "@test"])

        actual = self.primitive().get_function()(x)
        expected = [np.nan, np.nan, np.nan, 1.0]
        np.testing.assert_array_equal(actual, expected)

    def test_alphanumeric_and_special(self):
        x = pd.Series(["@1or0", "@12", "#??!>@?@#>"])

        actual = self.primitive().get_function()(x)
        expected = [1.0, 1.0, 0.0]
        np.testing.assert_array_equal(actual, expected)

    def test_underscore(self):
        x = pd.Series(["@user1", "@__yes", "#??!>@?@#>"])

        actual = self.primitive().get_function()(x)
        expected = [1.0, 1.0, 0.0]
        np.testing.assert_array_equal(actual, expected)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_unique_words.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import NumberOfUniqueWords
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestNumberOfUniqueWords(PrimitiveTestBase):
    primitive = NumberOfUniqueWords

    def test_general(self):
        x = pd.Series(
            [
                "test test test test",
                "test TEST test TEST",
                "and subsequent lines...",
            ],
        )

        expected = pd.Series([1, 2, 3])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_special_characters_and_whitespace(self):
        x = pd.Series(["50% 50 50% \t\t\t\n\n", "a test* test"])

        expected = pd.Series([1, 2])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_unicode_input(self):
        x = pd.Series(
            [
                "Ángel Angel Ángel ángel",
            ],
        )

        expected = pd.Series([3])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_contractions(self):
        x = pd.Series(
            [
                "can't won't don't can't aren't won't don't they'd there's",
            ],
        )

        expected = pd.Series([6])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_multiline(self):
        x = pd.Series(
            [
                "word word word word.",
                "This is \nthird line \nthird line",
            ],
        )

        expected = pd.Series([1, 4])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_null(self):
        x = pd.Series([np.nan, pd.NA, None, "This is a test file."])

        actual = self.primitive().get_function()(x)
        expected = pd.Series([pd.NA, pd.NA, pd.NA, 5])
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_case_insensitive(self):
        x = pd.Series(["WORD word WORd WORd WOrD word"])

        actual = self.primitive(case_insensitive=True).get_function()(x)
        expected = pd.Series([1])
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_words_in_quotes.py
================================================
import numpy as np
import pandas as pd
import pytest

from featuretools.primitives import NumberOfWordsInQuotes
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestNumberOfWordsInQuotes(PrimitiveTestBase):
    primitive = NumberOfWordsInQuotes

    def test_regular_double_quotes_input(self):
        x = pd.Series(
            [
                'Yes "    "',
                '"Hello this is a test"',
                '"Yes" "   "',
                "",
                '"Python, java prolog"',
                '"Python, java prolog" three words here "binary search algorithm"',
                '"Diffie-Hellman key exchange"',
                '"user@email.com"',
                '"https://alteryx.com"',
                '"100,000"',
                '"This Borderlands game here"" is the perfect conclusion to the ""Borderlands 3"" line, which focuses on the fans ""favorite character and gives the players the opportunity to close for a long time some very important questions about\'s character and the memorable scenery with which the players interact.',
            ],
        )
        expected = pd.Series([0, 5, 1, 0, 3, 6, 3, 1, 1, 1, 6], dtype="Int64")
        actual = self.primitive("double").get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_captures_regular_single_quotes(self):
        x = pd.Series(
            [
                "'Hello this is a test'",
                "'Python, Java Prolog'",
                "'Python, Java Prolog' three words here 'three words here'",
                "'Diffie-Hellman key exchange'",
                "'user@email.com'",
                "'https://alteryx.com'",
                "'there's where's here's' word 'word'",
                "'100,000'",
            ],
        )
        expected = pd.Series([5, 3, 6, 3, 1, 1, 4, 1], dtype="Int64")
        actual = self.primitive("single").get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_captures_both_single_and_double_quotes(self):
        x = pd.Series(
            [
                "'test test test test' three words here \"test test test!\"",
            ],
        )
        expected = pd.Series([7], dtype="Int64")
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_unicode_input(self):
        x = pd.Series(
            [
                '"Ángel"',
                '"Ángel" word word',
            ],
        )
        expected = pd.Series([1, 1], dtype="Int64")
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_multiline(self):
        x = pd.Series(
            [
                "'Yes\n, this is me'",
            ],
        )
        expected = pd.Series([4], dtype="Int64")
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_raises_error_invalid_args(self):
        error_msg = (
            "NULL is not a valid quote_type. Specify 'both', 'single', or 'double'"
        )
        with pytest.raises(
            ValueError,
            match=error_msg,
        ):
            self.primitive(quote_type="NULL")

    def test_null(self):
        x = pd.Series([np.nan, pd.NA, None, '"test"'])
        actual = self.primitive().get_function()(x)
        expected = pd.Series([pd.NA, pd.NA, pd.NA, 1.0], dtype="Int64")
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_punctuation_count.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import PunctuationCount
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestPunctuationCount(PrimitiveTestBase):
    primitive = PunctuationCount

    def test_punctuation(self):
        x = pd.Series(
            [
                "This is a test file.",
                "This, is second line?",
                "third/line $1,000;",
                "and--subsequen't lines...",
                "*and, more..",
            ],
        )
        primitive_func = self.primitive().get_function()
        answers = [1.0, 2.0, 4.0, 6.0, 4.0]
        np.testing.assert_array_equal(primitive_func(x), answers)

    def test_multiline(self):
        x = pd.Series(
            [
                "This is a test file.",
                "This is second line\nthird line $1000;\nand subsequent lines",
            ],
        )
        primitive_func = self.primitive().get_function()
        answers = [1.0, 2.0]
        np.testing.assert_array_equal(primitive_func(x), answers)

    def test_nan(self):
        x = pd.Series([np.nan, "", "This is a test file."])
        primitive_func = self.primitive().get_function()
        answers = [np.nan, 0.0, 1.0]
        np.testing.assert_array_equal(primitive_func(x), answers)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_title_word_count.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import TitleWordCount
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestTitleWordCount(PrimitiveTestBase):
    primitive = TitleWordCount

    def test_strings(self):
        x = pd.Series(
            [
                "My favorite movie is Jaws.",
                "this is a string",
                "AAA",
                "I bought a Yo-Yo",
            ],
        )
        primitive_func = self.primitive().get_function()
        answers = [2.0, 0.0, 1.0, 2.0]
        np.testing.assert_array_equal(answers, primitive_func(x))

    def test_nan(self):
        x = pd.Series([np.nan, "", "My favorite movie is Jaws."])
        primitive_func = self.primitive().get_function()
        answers = [np.nan, 0.0, 2.0]
        np.testing.assert_array_equal(answers, primitive_func(x))

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_total_word_length.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import TotalWordLength
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestTotalWordLength(PrimitiveTestBase):
    primitive = TotalWordLength

    def test_delimiter_override(self):
        x = pd.Series(
            ["This is a test file.", "This,is,second,line?", "and;subsequent;lines..."],
        )

        expected = pd.Series([16, 17, 21])
        actual = self.primitive("[ ,;]").get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_multiline(self):
        x = pd.Series(
            [
                "This is a test file.",
                "This is second line\nthird line $1000;\nand subsequent lines",
            ],
        )

        expected = pd.Series([15, 47])
        actual = self.primitive().get_function()(x)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_null(self):
        x = pd.Series([np.nan, pd.NA, None, "This is a test file."])

        expected = pd.Series([np.nan, np.nan, np.nan, 15])
        actual = self.primitive().get_function()(x).astype(float)
        pd.testing.assert_series_equal(actual, expected, check_names=False)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_upper_case_count.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import UpperCaseCount
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestUpperCaseCount(PrimitiveTestBase):
    primitive = UpperCaseCount

    def test_strings(self):
        x = pd.Series(
            ["This IS a STRING.", "Testing AaA", "Testing AAA-BBB", "testing aaa"],
        )
        primitive_func = self.primitive().get_function()
        answers = [9.0, 3.0, 7.0, 0.0]
        np.testing.assert_array_equal(primitive_func(x), answers)

    def test_nan(self):
        x = pd.Series([np.nan, "", "This IS a STRING."])
        primitive_func = self.primitive().get_function()
        answers = [np.nan, 0.0, 9.0]
        np.testing.assert_array_equal(primitive_func(x), answers)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_upper_case_word_count.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import UpperCaseWordCount


class TestUpperCaseWordCount:
    primitive = UpperCaseWordCount

    def test_strings(self):
        x = pd.Series(
            [
                "This IS a STRING.",
                "Testing AAA",
                "Testing AAA BBB",
                "Testing TEsTIng AA3 AA_33 HELLO",
                "AAA $@()#$@@#$",
            ],
            dtype="string",
        )
        primitive_func = self.primitive().get_function()
        answers = pd.Series([2, 1, 2, 3, 1], dtype="Int64")
        pd.testing.assert_series_equal(
            primitive_func(x).astype("Int64"),
            answers,
            check_names=False,
        )

    def test_nan(self):
        x = pd.Series(
            [
                np.nan,
                "",
                "This IS a STRING.",
            ],
            dtype="string",
        )
        primitive_func = self.primitive().get_function()
        answers = pd.Series([pd.NA, 0, 2], dtype="Int64")
        pd.testing.assert_series_equal(
            primitive_func(x).astype("Int64"),
            answers,
            check_names=False,
        )


================================================
FILE: featuretools/tests/primitive_tests/natural_language_primitives_tests/test_whitespace_count.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import WhitespaceCount
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestWhitespaceCount(PrimitiveTestBase):
    primitive = WhitespaceCount

    def compare(self, primitive_initiated, test_cases, answers):
        primitive_func = primitive_initiated.get_function()
        primitive_answers = primitive_func(test_cases)
        return np.testing.assert_array_equal(answers, primitive_answers)

    def test_strings(self):
        x = pd.Series(
            ["", "hi im ethan!", "consecutive.    spaces.", " spaces-on-ends "],
        )
        answers = [0, 2, 4, 2]
        self.compare(self.primitive(), x, answers)

    def test_nan(self):
        x = pd.Series([np.nan, None, pd.NA, "", "This IS a STRING."])
        answers = [np.nan, np.nan, np.nan, 0, 3]
        self.compare(self.primitive(), x, answers)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/primitives_to_install/__init__.py
================================================


================================================
FILE: featuretools/tests/primitive_tests/primitives_to_install/custom_max.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import AggregationPrimitive


class CustomMax(AggregationPrimitive):
    name = "custom_max"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})


================================================
FILE: featuretools/tests/primitive_tests/primitives_to_install/custom_mean.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import AggregationPrimitive


class CustomMean(AggregationPrimitive):
    name = "custom_mean"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})


================================================
FILE: featuretools/tests/primitive_tests/primitives_to_install/custom_sum.py
================================================
from woodwork.column_schema import ColumnSchema

from featuretools.primitives.base import AggregationPrimitive


class CustomSum(AggregationPrimitive):
    name = "custom_sum"
    input_types = [ColumnSchema(semantic_tags={"numeric"})]
    return_type = ColumnSchema(semantic_tags={"numeric"})


================================================
FILE: featuretools/tests/primitive_tests/test_absolute_diff.py
================================================
import numpy as np
import pandas as pd
import pytest

from featuretools.primitives import AbsoluteDiff


class TestAbsoluteDiff:
    def test_nan(self):
        data = pd.Series([np.nan, 5, 10, 20, np.nan, 10, np.nan])
        answer = pd.Series([np.nan, np.nan, 5, 10, 0, 10, 0])
        primitive_func = AbsoluteDiff().get_function()
        given_answer = primitive_func(data)
        np.testing.assert_array_equal(given_answer, answer)

    def test_regular(self):
        data = pd.Series([2, 5, 15, 3, 9, 4.5])
        answer = pd.Series([np.nan, 3, 10, 12, 6, 4.5])
        primitive_func = AbsoluteDiff().get_function()
        given_answer = primitive_func(data)
        np.testing.assert_array_equal(given_answer, answer)

    def test_method(self):
        data = pd.Series([2, np.nan, 15, 3, np.nan, 4.5])
        answer = pd.Series([np.nan, 13, 0, 12, 1.5, 0])
        primitive_func = AbsoluteDiff(method="backfill").get_function()
        given_answer = primitive_func(data)
        np.testing.assert_array_equal(given_answer, answer)

    def test_limit(self):
        data = pd.Series([2, np.nan, np.nan, np.nan, 3.0, 4.5])
        answer = pd.Series([np.nan, 0, 0, np.nan, np.nan, 1.5])
        primitive_func = AbsoluteDiff(limit=2).get_function()
        given_answer = primitive_func(data)
        np.testing.assert_array_equal(given_answer, answer)

    def test_zero(self):
        data = pd.Series([2, 0, 0, 5, 0, -4])
        answer = pd.Series([np.nan, 2, 0, 5, 5, 4])
        primitive_func = AbsoluteDiff().get_function()
        given_answer = primitive_func(data)
        np.testing.assert_array_equal(given_answer, answer)

    def test_empty(self):
        data = pd.Series([], dtype="float64")
        answer = pd.Series([], dtype="float64")
        primitive_func = AbsoluteDiff().get_function()
        given_answer = primitive_func(data)
        np.testing.assert_array_equal(given_answer, answer)

    def test_inf(self):
        data = pd.Series([0, np.inf, 0, 5, np.NINF, np.inf, np.NINF])
        answer = pd.Series([np.nan, np.inf, np.inf, 5, np.inf, np.inf, np.inf])
        primitive_func = AbsoluteDiff().get_function()
        given_answer = primitive_func(data)
        np.testing.assert_array_equal(given_answer, answer)

    def test_raises(self):
        with pytest.raises(ValueError):
            AbsoluteDiff(method="invalid")


================================================
FILE: featuretools/tests/primitive_tests/test_agg_feats.py
================================================
from datetime import datetime
from inspect import isclass
from math import isnan

import numpy as np
import pandas as pd
import pytest
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime

from featuretools import (
    AggregationFeature,
    Feature,
    IdentityFeature,
    Timedelta,
    calculate_feature_matrix,
    dfs,
    primitives,
)
from featuretools.entityset.relationship import RelationshipPath
from featuretools.feature_base.cache import feature_cache
from featuretools.primitives import (
    Count,
    Max,
    Mean,
    Median,
    NMostCommon,
    NumTrue,
    NumUnique,
    Sum,
    TimeSinceFirst,
    TimeSinceLast,
    get_aggregation_primitives,
)
from featuretools.primitives.base import AggregationPrimitive
from featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis, match
from featuretools.tests.testing_utils import backward_path, feature_with_name


@pytest.fixture(autouse=True)
def reset_dfs_cache():
    feature_cache.enabled = False
    feature_cache.clear_all()


def test_get_depth(es):
    log_id_feat = IdentityFeature(es["log"].ww["id"])
    customer_id_feat = IdentityFeature(es["customers"].ww["id"])
    count_logs = Feature(log_id_feat, parent_dataframe_name="sessions", primitive=Count)
    sum_count_logs = Feature(
        count_logs,
        parent_dataframe_name="customers",
        primitive=Sum,
    )
    num_logs_greater_than_5 = sum_count_logs > 5
    count_customers = Feature(
        customer_id_feat,
        parent_dataframe_name="régions",
        where=num_logs_greater_than_5,
        primitive=Count,
    )
    num_customers_region = Feature(count_customers, dataframe_name="customers")

    depth = num_customers_region.get_depth()
    assert depth == 5


def test_makes_count(es):
    dfs = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Count],
        trans_primitives=[],
    )

    features = dfs.build_features()
    assert feature_with_name(features, "device_type")
    assert feature_with_name(features, "customer_id")
    assert feature_with_name(features, "customers.région_id")
    assert feature_with_name(features, "customers.age")
    assert feature_with_name(features, "COUNT(log)")
    assert feature_with_name(features, "customers.COUNT(sessions)")
    assert feature_with_name(features, "customers.régions.language")
    assert feature_with_name(features, "customers.COUNT(log)")


def test_count_null(es):
    class Count(AggregationPrimitive):
        name = "count"
        input_types = [[ColumnSchema(semantic_tags={"foreign_key"})], [ColumnSchema()]]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        stack_on_self = False

        def __init__(self, count_null=True):
            self.count_null = count_null

        def get_function(self):
            def count_func(values):
                if self.count_null:
                    values = values.fillna(0)

                return values.count()

            return count_func

        def generate_name(
            self,
            base_feature_names,
            relationship_path_name,
            parent_dataframe_name,
            where_str,
            use_prev_str,
        ):
            return "COUNT(%s%s%s)" % (relationship_path_name, where_str, use_prev_str)

    count_null = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="sessions",
        primitive=Count(count_null=True),
    )
    feature_matrix = calculate_feature_matrix([count_null], entityset=es)
    values = [5, 4, 1, 2, 3, 2]
    assert (values == feature_matrix[count_null.get_name()]).all()


def test_check_input_types(es):
    count = Feature(
        es["sessions"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    mean = Feature(count, parent_dataframe_name="régions", primitive=Mean)
    assert mean._check_input_types()

    boolean = count > 3
    mean = Feature(
        count,
        parent_dataframe_name="régions",
        where=boolean,
        primitive=Mean,
    )
    assert mean._check_input_types()


def test_mean_nan(es):
    array = pd.Series([5, 5, 5, 5, 5])
    mean_func_nans_default = Mean().get_function()
    mean_func_nans_false = Mean(skipna=False).get_function()
    mean_func_nans_true = Mean(skipna=True).get_function()
    assert mean_func_nans_default(array) == 5
    assert mean_func_nans_false(array) == 5
    assert mean_func_nans_true(array) == 5
    array = pd.Series([5, np.nan, np.nan, np.nan, np.nan, 10])
    assert mean_func_nans_default(array) == 7.5
    assert isnan(mean_func_nans_false(array))
    assert mean_func_nans_true(array) == 7.5
    array_nans = pd.Series([np.nan, np.nan, np.nan, np.nan])
    assert isnan(mean_func_nans_default(array_nans))
    assert isnan(mean_func_nans_false(array_nans))
    assert isnan(mean_func_nans_true(array_nans))

    # test naming
    default_feat = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Mean,
    )
    assert default_feat.get_name() == "MEAN(log.value)"
    ignore_nan_feat = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Mean(skipna=True),
    )
    assert ignore_nan_feat.get_name() == "MEAN(log.value)"
    include_nan_feat = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Mean(skipna=False),
    )
    assert include_nan_feat.get_name() == "MEAN(log.value, skipna=False)"


def test_init_and_name(es):
    log = es["log"]

    # Add a BooleanNullable column so primitives with that input type get tested
    boolean_nullable = log.ww["purchased"]
    boolean_nullable = boolean_nullable.ww.set_logical_type("BooleanNullable")
    log.ww["boolean_nullable"] = boolean_nullable

    features = [Feature(es["log"].ww[col]) for col in log.columns]

    # check all primitives have name
    for attribute_string in dir(primitives):
        attr = getattr(primitives, attribute_string)
        if isclass(attr):
            if issubclass(attr, AggregationPrimitive) and attr != AggregationPrimitive:
                assert getattr(attr, "name") is not None

    agg_primitives = get_aggregation_primitives().values()

    for agg_prim in agg_primitives:
        input_types = agg_prim.input_types
        if not isinstance(input_types[0], list):
            input_types = [input_types]

        # test each allowed input_types for this primitive
        for it in input_types:
            # use the input_types matching function from DFS
            matching_types = match(it, features)
            if len(matching_types) == 0:
                raise Exception("Agg Primitive %s not tested" % agg_prim.name)
            for t in matching_types:
                instance = Feature(
                    t,
                    parent_dataframe_name="sessions",
                    primitive=agg_prim,
                )

                # try to get name and calculate
                instance.get_name()
                calculate_feature_matrix([instance], entityset=es)


def test_invalid_init_args(diamond_es):
    error_text = "parent_dataframe must match first relationship in path"
    with pytest.raises(AssertionError, match=error_text):
        path = backward_path(diamond_es, ["stores", "transactions"])
        AggregationFeature(
            IdentityFeature(diamond_es["transactions"].ww["amount"]),
            "customers",
            Mean,
            relationship_path=path,
        )

    error_text = (
        "Base feature must be defined on the dataframe at the end of relationship_path"
    )
    with pytest.raises(AssertionError, match=error_text):
        path = backward_path(diamond_es, ["regions", "stores"])
        AggregationFeature(
            IdentityFeature(diamond_es["transactions"].ww["amount"]),
            "regions",
            Mean,
            relationship_path=path,
        )

    error_text = "All relationships in path must be backward"
    with pytest.raises(AssertionError, match=error_text):
        backward = backward_path(diamond_es, ["customers", "transactions"])
        forward = RelationshipPath([(True, r) for _, r in backward])
        path = RelationshipPath(list(forward) + list(backward))
        AggregationFeature(
            IdentityFeature(diamond_es["transactions"].ww["amount"]),
            "transactions",
            Mean,
            relationship_path=path,
        )


def test_init_with_multiple_possible_paths(diamond_es):
    error_text = (
        "There are multiple possible paths to the base dataframe. "
        "You must specify a relationship path."
    )
    with pytest.raises(RuntimeError, match=error_text):
        AggregationFeature(
            IdentityFeature(diamond_es["transactions"].ww["amount"]),
            "regions",
            Mean,
        )

    # Does not raise if path specified.
    path = backward_path(diamond_es, ["regions", "customers", "transactions"])
    AggregationFeature(
        IdentityFeature(diamond_es["transactions"].ww["amount"]),
        "regions",
        Mean,
        relationship_path=path,
    )


def test_init_with_single_possible_path(diamond_es):
    # This uses diamond_es to test that there being a cycle somewhere in the
    # graph doesn't cause an error.
    feat = AggregationFeature(
        IdentityFeature(diamond_es["transactions"].ww["amount"]),
        "customers",
        Mean,
    )
    expected_path = backward_path(diamond_es, ["customers", "transactions"])
    assert feat.relationship_path == expected_path


def test_init_with_no_path(diamond_es):
    error_text = 'No backward path from "transactions" to "customers" found.'
    with pytest.raises(RuntimeError, match=error_text):
        AggregationFeature(
            IdentityFeature(diamond_es["customers"].ww["name"]),
            "transactions",
            Count,
        )

    error_text = 'No backward path from "transactions" to "transactions" found.'
    with pytest.raises(RuntimeError, match=error_text):
        AggregationFeature(
            IdentityFeature(diamond_es["transactions"].ww["amount"]),
            "transactions",
            Mean,
        )


def test_name_with_multiple_possible_paths(diamond_es):
    path = backward_path(diamond_es, ["regions", "customers", "transactions"])
    feat = AggregationFeature(
        IdentityFeature(diamond_es["transactions"].ww["amount"]),
        "regions",
        Mean,
        relationship_path=path,
    )

    assert feat.get_name() == "MEAN(customers.transactions.amount)"
    assert feat.relationship_path_name() == "customers.transactions"


def test_copy(games_es):
    home_games = next(
        r for r in games_es.relationships if r._child_column_name == "home_team_id"
    )
    path = RelationshipPath([(False, home_games)])
    feat = AggregationFeature(
        IdentityFeature(games_es["games"].ww["home_team_score"]),
        "teams",
        relationship_path=path,
        primitive=Mean,
    )
    copied = feat.copy()
    assert copied.dataframe_name == feat.dataframe_name
    assert copied.base_features == feat.base_features
    assert copied.relationship_path == feat.relationship_path
    assert copied.primitive == feat.primitive


def test_serialization(es):
    value = IdentityFeature(es["log"].ww["value"])
    primitive = Max()
    max1 = AggregationFeature(value, "customers", primitive)

    path = next(es.find_backward_paths("customers", "log"))
    dictionary = {
        "name": max1.get_name(),
        "base_features": [value.unique_name()],
        "relationship_path": [r.to_dictionary() for r in path],
        "primitive": primitive,
        "where": None,
        "use_previous": None,
    }

    assert dictionary == max1.get_arguments()
    deserialized = AggregationFeature.from_dictionary(
        dictionary,
        es,
        {value.unique_name(): value},
        primitive,
    )
    _assert_agg_feats_equal(max1, deserialized)

    is_purchased = IdentityFeature(es["log"].ww["purchased"])
    use_previous = Timedelta(3, "d")
    max2 = AggregationFeature(
        value,
        "customers",
        primitive,
        where=is_purchased,
        use_previous=use_previous,
    )

    dictionary = {
        "name": max2.get_name(),
        "base_features": [value.unique_name()],
        "relationship_path": [r.to_dictionary() for r in path],
        "primitive": primitive,
        "where": is_purchased.unique_name(),
        "use_previous": use_previous.get_arguments(),
    }

    assert dictionary == max2.get_arguments()
    dependencies = {
        value.unique_name(): value,
        is_purchased.unique_name(): is_purchased,
    }
    deserialized = AggregationFeature.from_dictionary(
        dictionary,
        es,
        dependencies,
        primitive,
    )
    _assert_agg_feats_equal(max2, deserialized)


def test_time_since_last(es):
    f = Feature(
        es["log"].ww["datetime"],
        parent_dataframe_name="customers",
        primitive=TimeSinceLast,
    )
    fm = calculate_feature_matrix(
        [f],
        entityset=es,
        instance_ids=[0, 1, 2],
        cutoff_time=datetime(2015, 6, 8),
    )

    correct = [131376000.0, 131289534.0, 131287797.0]
    # note: must round to nearest second
    assert all(fm[f.get_name()].round().values == correct)


def test_time_since_first(es):
    f = Feature(
        es["log"].ww["datetime"],
        parent_dataframe_name="customers",
        primitive=TimeSinceFirst,
    )
    fm = calculate_feature_matrix(
        [f],
        entityset=es,
        instance_ids=[0, 1, 2],
        cutoff_time=datetime(2015, 6, 8),
    )

    correct = [131376600.0, 131289600.0, 131287800.0]
    # note: must round to nearest second
    assert all(fm[f.get_name()].round().values == correct)


def test_median(es):
    f = Feature(
        es["log"].ww["value_many_nans"],
        parent_dataframe_name="customers",
        primitive=Median,
    )
    fm = calculate_feature_matrix(
        [f],
        entityset=es,
        instance_ids=[0, 1, 2],
        cutoff_time=datetime(2015, 6, 8),
    )

    correct = [1, 3, np.nan]
    np.testing.assert_equal(fm[f.get_name()].values, correct)


def test_agg_same_method_name(es):
    """
    Pandas relies on the function name when calculating aggregations. This means if a two
    primitives with the same function name are applied to the same column, pandas
    can't differentiate them. We have a work around to this based on the name property
    that we test here.
    """

    # test with normally defined functions
    class Sum(AggregationPrimitive):
        name = "sum"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})

        def get_function(self):
            def custom_primitive(x):
                return x.sum()

            return custom_primitive

    class Max(AggregationPrimitive):
        name = "max"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})

        def get_function(self):
            def custom_primitive(x):
                return x.max()

            return custom_primitive

    f_sum = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Sum,
    )
    f_max = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Max,
    )

    fm = calculate_feature_matrix([f_sum, f_max], entityset=es)
    assert fm.columns.tolist() == [f_sum.get_name(), f_max.get_name()]

    # test with lambdas
    class Sum(AggregationPrimitive):
        name = "sum"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})

        def get_function(self):
            return lambda x: x.sum()

    class Max(AggregationPrimitive):
        name = "max"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})

        def get_function(self):
            return lambda x: x.max()

    f_sum = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Sum,
    )
    f_max = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Max,
    )
    fm = calculate_feature_matrix([f_sum, f_max], entityset=es)
    assert fm.columns.tolist() == [f_sum.get_name(), f_max.get_name()]


def test_time_since_last_custom(es):
    class TimeSinceLast(AggregationPrimitive):
        name = "time_since_last"
        input_types = [
            ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        uses_calc_time = True

        def get_function(self):
            def time_since_last(values, time):
                time_since = time - values.iloc[0]
                return time_since.total_seconds()

            return time_since_last

    f = Feature(
        es["log"].ww["datetime"],
        parent_dataframe_name="customers",
        primitive=TimeSinceLast,
    )
    fm = calculate_feature_matrix(
        [f],
        entityset=es,
        instance_ids=[0, 1, 2],
        cutoff_time=datetime(2015, 6, 8),
    )

    correct = [131376600, 131289600, 131287800]
    # note: must round to nearest second
    assert all(fm[f.get_name()].round().values == correct)


def test_custom_primitive_multiple_inputs(es):
    class MeanSunday(AggregationPrimitive):
        name = "mean_sunday"
        input_types = [
            ColumnSchema(semantic_tags={"numeric"}),
            ColumnSchema(logical_type=Datetime),
        ]
        return_type = ColumnSchema(semantic_tags={"numeric"})

        def get_function(self):
            def mean_sunday(numeric, datetime):
                """
                Finds the mean of non-null values of a feature that occurred on Sundays
                """
                days = pd.DatetimeIndex(datetime).weekday.values
                df = pd.DataFrame({"numeric": numeric, "time": days})
                return df[df["time"] == 6]["numeric"].mean()

            return mean_sunday

    fm, features = dfs(
        entityset=es,
        target_dataframe_name="sessions",
        agg_primitives=[MeanSunday],
        trans_primitives=[],
    )
    mean_sunday_value = pd.Series([None, None, None, 2.5, 7, None])
    iterator = zip(fm["MEAN_SUNDAY(log.value, datetime)"], mean_sunday_value)
    for x, y in iterator:
        assert (pd.isnull(x) and pd.isnull(y)) or (x == y)

    es.add_interesting_values()
    mean_sunday_value_priority_0 = pd.Series([None, None, None, 2.5, 0, None])
    fm, features = dfs(
        entityset=es,
        target_dataframe_name="sessions",
        agg_primitives=[MeanSunday],
        trans_primitives=[],
        where_primitives=[MeanSunday],
    )
    where_feat = "MEAN_SUNDAY(log.value, datetime WHERE priority_level = 0)"
    for x, y in zip(fm[where_feat], mean_sunday_value_priority_0):
        assert (pd.isnull(x) and pd.isnull(y)) or (x == y)


def test_custom_primitive_default_kwargs(es):
    class SumNTimes(AggregationPrimitive):
        name = "sum_n_times"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})

        def __init__(self, n=1):
            self.n = n

    sum_n_1_n = 1
    sum_n_1_base_f = Feature(es["log"].ww["value"])
    sum_n_1 = Feature(
        [sum_n_1_base_f],
        parent_dataframe_name="sessions",
        primitive=SumNTimes(n=sum_n_1_n),
    )
    sum_n_2_n = 2
    sum_n_2_base_f = Feature(es["log"].ww["value_2"])
    sum_n_2 = Feature(
        [sum_n_2_base_f],
        parent_dataframe_name="sessions",
        primitive=SumNTimes(n=sum_n_2_n),
    )
    assert sum_n_1_base_f == sum_n_1.base_features[0]
    assert sum_n_1_n == sum_n_1.primitive.n
    assert sum_n_2_base_f == sum_n_2.base_features[0]
    assert sum_n_2_n == sum_n_2.primitive.n


def test_makes_numtrue(es):
    dfs = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[NumTrue],
        trans_primitives=[],
    )
    features = dfs.build_features()
    assert feature_with_name(features, "customers.NUM_TRUE(log.purchased)")
    assert feature_with_name(features, "NUM_TRUE(log.purchased)")


def test_make_three_most_common(es):
    class NMostCommoner(AggregationPrimitive):
        name = "pd_top3"
        input_types = ([ColumnSchema(semantic_tags={"category"})],)
        return_type = None
        number_output_features = 3

        def get_function(self):
            def pd_top3(x):
                counts = x.value_counts()
                counts = counts[counts > 0]
                array = np.array(counts[:3].index)
                if len(array) < 3:
                    filler = np.full(3 - len(array), np.nan)
                    array = np.append(array, filler)
                return array

            return pd_top3

    fm, features = dfs(
        entityset=es,
        target_dataframe_name="customers",
        instance_ids=[0, 1, 2],
        agg_primitives=[NMostCommoner],
        trans_primitives=[],
    )

    df = fm[["PD_TOP3(log.product_id)[%s]" % i for i in range(3)]]

    assert set(df.iloc[0].values[:2]) == set(
        ["coke zero", "toothpaste"],
    )  # coke zero and toothpaste have same number of occurrences
    assert df.iloc[0].values[2] in [
        "car",
        "brown bag",
    ]  # so just check that the top two match

    assert (
        df.iloc[1]
        .reset_index(drop=True)
        .equals(pd.Series(["coke zero", "Haribo sugar-free gummy bears", np.nan]))
    )
    assert (
        df.iloc[2]
        .reset_index(drop=True)
        .equals(pd.Series(["taco clock", np.nan, np.nan]))
    )


def test_stacking_multi(es):
    threecommon = NMostCommon(3)
    tc = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="sessions",
        primitive=threecommon,
    )

    stacked = []
    for i in range(3):
        stacked.append(
            Feature(tc[i], parent_dataframe_name="customers", primitive=NumUnique),
        )

    fm = calculate_feature_matrix(stacked, entityset=es, instance_ids=[0, 1, 2])

    correct_vals = [[3, 2, 1], [2, 1, 0], [0, 0, 0]]
    correct_vals1 = [[3, 1, 1], [2, 1, 0], [0, 0, 0]]
    # either of the above can be correct, and the outcome depends on the sorting of
    # two values in the initial n most common function, which changes arbitrarily.

    for i in range(3):
        f = "NUM_UNIQUE(sessions.N_MOST_COMMON(log.product_id)[%d])" % i
        cols = fm.columns
        assert f in cols
        assert (
            fm[cols[i]].tolist() == correct_vals[i]
            or fm[cols[i]].tolist() == correct_vals1[i]
        )


def test_use_previous_pd_dateoffset(es):
    total_events_pd = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="customers",
        use_previous=pd.DateOffset(hours=47, minutes=60),
        primitive=Count,
    )

    feature_matrix = calculate_feature_matrix(
        [total_events_pd],
        es,
        cutoff_time=pd.Timestamp("2011-04-11 10:31:30"),
        instance_ids=[0, 1, 2],
    )
    col_name = list(feature_matrix.head().keys())[0]
    assert (feature_matrix[col_name] == [1, 5, 2]).all()


def _assert_agg_feats_equal(f1, f2):
    assert f1.unique_name() == f2.unique_name()
    assert f1.child_dataframe_name == f2.child_dataframe_name
    assert f1.parent_dataframe_name == f2.parent_dataframe_name
    assert f1.relationship_path == f2.relationship_path
    assert f1.use_previous == f2.use_previous


def test_override_multi_feature_names(es):
    def gen_custom_names(
        primitive,
        base_feature_names,
        relationship_path_name,
        parent_dataframe_name,
        where_str,
        use_prev_str,
    ):
        base_string = "Custom_%s({}.{})".format(
            parent_dataframe_name,
            base_feature_names,
        )
        return [base_string % i for i in range(primitive.number_output_features)]

    class NMostCommoner(AggregationPrimitive):
        name = "pd_top3"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"category"})
        number_output_features = 3

        def generate_names(
            self,
            base_feature_names,
            relationship_path_name,
            parent_dataframe_name,
            where_str,
            use_prev_str,
        ):
            return gen_custom_names(
                self,
                base_feature_names,
                relationship_path_name,
                parent_dataframe_name,
                where_str,
                use_prev_str,
            )

    fm, features = dfs(
        entityset=es,
        target_dataframe_name="products",
        instance_ids=[0, 1, 2],
        agg_primitives=[NMostCommoner],
        trans_primitives=[],
    )

    expected_names = []
    base_names = [["value"], ["value_2"], ["value_many_nans"]]
    for name in base_names:
        expected_names += gen_custom_names(
            NMostCommoner,
            name,
            None,
            "products",
            None,
            None,
        )

    for name in expected_names:
        assert name in fm.columns


================================================
FILE: featuretools/tests/primitive_tests/test_all_primitive_docstrings.py
================================================
from featuretools.primitives import get_aggregation_primitives, get_transform_primitives


def docstring_is_uniform(primitive):
    docstring = primitive.__doc__
    valid_verbs = [
        "Calculates",
        "Determines",
        "Transforms",
        "Computes",
        "Counts",
        "Negates",
        "Adds",
        "Subtracts",
        "Multiplies",
        "Divides",
        "Performs",
        "Returns",
        "Shifts",
        "Extracts",
        "Applies",
    ]
    return any(docstring.startswith(s) for s in valid_verbs)


def test_transform_primitive_docstrings():
    for primitive in get_transform_primitives().values():
        assert docstring_is_uniform(primitive)


def test_aggregation_primitive_docstrings():
    for primitive in get_aggregation_primitives().values():
        assert docstring_is_uniform(primitive)


================================================
FILE: featuretools/tests/primitive_tests/test_direct_features.py
================================================
import numpy as np
import pandas as pd
import pytest
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime

from featuretools.computational_backends.feature_set import FeatureSet
from featuretools.computational_backends.feature_set_calculator import (
    FeatureSetCalculator,
)
from featuretools.feature_base import DirectFeature, Feature, IdentityFeature
from featuretools.primitives import (
    AggregationPrimitive,
    Day,
    Hour,
    Minute,
    Month,
    NMostCommon,
    Second,
    TransformPrimitive,
    Year,
)
from featuretools.primitives.utils import PrimitivesDeserializer
from featuretools.synthesis import dfs


def test_direct_from_identity(es):
    device = Feature(es["sessions"].ww["device_type"])
    d = DirectFeature(base_feature=device, child_dataframe_name="log")

    feature_set = FeatureSet([d])
    calculator = FeatureSetCalculator(es, feature_set=feature_set, time_last=None)
    df = calculator.run(np.array([0, 5]))
    v = df[d.get_name()].tolist()
    expected = [0, 1]
    assert v == expected


def test_direct_from_column(es):
    # should be same behavior as test_direct_from_identity
    device = Feature(es["sessions"].ww["device_type"])
    d = DirectFeature(base_feature=device, child_dataframe_name="log")

    feature_set = FeatureSet([d])
    calculator = FeatureSetCalculator(es, feature_set=feature_set, time_last=None)
    df = calculator.run(np.array([0, 5]))
    v = df[d.get_name()].tolist()
    expected = [0, 1]
    assert v == expected


def test_direct_rename_multioutput(es):
    n_common = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="customers",
        primitive=NMostCommon(n=2),
    )
    feat = DirectFeature(n_common, "sessions")
    copy_feat = feat.rename("session_test")
    assert feat.unique_name() != copy_feat.unique_name()
    assert feat.get_name() != copy_feat.get_name()
    assert (
        feat.base_features[0].generate_name()
        == copy_feat.base_features[0].generate_name()
    )
    assert feat.dataframe_name == copy_feat.dataframe_name


def test_direct_rename(es):
    # should be same behavior as test_direct_from_identity
    feat = DirectFeature(
        base_feature=IdentityFeature(es["sessions"].ww["device_type"]),
        child_dataframe_name="log",
    )
    copy_feat = feat.rename("session_test")
    assert feat.unique_name() != copy_feat.unique_name()
    assert feat.get_name() != copy_feat.get_name()
    assert (
        feat.base_features[0].generate_name()
        == copy_feat.base_features[0].generate_name()
    )
    assert feat.dataframe_name == copy_feat.dataframe_name


def test_direct_copy(games_es):
    home_team = next(
        r for r in games_es.relationships if r._child_column_name == "home_team_id"
    )
    feat = DirectFeature(
        IdentityFeature(games_es["teams"].ww["name"]),
        "games",
        relationship=home_team,
    )
    copied = feat.copy()
    assert copied.dataframe_name == feat.dataframe_name
    assert copied.base_features == feat.base_features
    assert copied.relationship_path == feat.relationship_path


def test_direct_of_multi_output_transform_feat(es):
    class TestTime(TransformPrimitive):
        name = "test_time"
        input_types = [ColumnSchema(logical_type=Datetime)]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        number_output_features = 6

        def get_function(self):
            def test_f(x):
                times = pd.Series(x)
                units = ["year", "month", "day", "hour", "minute", "second"]
                return [times.apply(lambda x: getattr(x, unit)) for unit in units]

            return test_f

    base_feature = IdentityFeature(es["customers"].ww["signup_date"])
    join_time_split = Feature(base_feature, primitive=TestTime)
    alt_features = [
        Feature(base_feature, primitive=Year),
        Feature(base_feature, primitive=Month),
        Feature(base_feature, primitive=Day),
        Feature(base_feature, primitive=Hour),
        Feature(base_feature, primitive=Minute),
        Feature(base_feature, primitive=Second),
    ]
    fm, fl = dfs(
        entityset=es,
        target_dataframe_name="sessions",
        trans_primitives=[TestTime, Year, Month, Day, Hour, Minute, Second],
    )

    # Get column names of for multi feature and normal features
    subnames = DirectFeature(join_time_split, "sessions").get_feature_names()
    altnames = [DirectFeature(f, "sessions").get_name() for f in alt_features]

    # Check values are equal between
    for col1, col2 in zip(subnames, altnames):
        assert (fm[col1] == fm[col2]).all()


def test_direct_features_of_multi_output_agg_primitives(es):
    class ThreeMostCommonCat(AggregationPrimitive):
        name = "n_most_common_categorical"
        input_types = [ColumnSchema(semantic_tags={"category"})]
        return_type = ColumnSchema(semantic_tags={"category"})
        number_output_features = 3

        def get_function(self):
            def pd_top3(x):
                counts = x.value_counts()
                counts = counts[counts > 0]
                array = np.array(counts.index[:3])
                if len(array) < 3:
                    filler = np.full(3 - len(array), np.nan)
                    array = np.append(array, filler)
                return array

            return pd_top3

    fm, fl = dfs(
        entityset=es,
        target_dataframe_name="log",
        agg_primitives=[ThreeMostCommonCat],
        trans_primitives=[],
        max_depth=3,
    )

    has_nmost_as_base = []
    for feature in fl:
        is_base = False
        if len(feature.base_features) > 0 and isinstance(
            feature.base_features[0].primitive,
            ThreeMostCommonCat,
        ):
            is_base = True
        has_nmost_as_base.append(is_base)
    assert any(has_nmost_as_base)

    true_result_rows = []
    session_data = {
        0: ["coke zero", "car", np.nan],
        1: ["toothpaste", "brown bag", np.nan],
        2: ["brown bag", np.nan, np.nan],
        3: set(["Haribo sugar-free gummy bears", "coke zero", np.nan]),
        4: ["coke zero", np.nan, np.nan],
        5: ["taco clock", np.nan, np.nan],
    }
    for i, count in enumerate([5, 4, 1, 2, 3, 2]):
        while count > 0:
            true_result_rows.append(session_data[i])
            count -= 1

    tempname = "sessions.N_MOST_COMMON_CATEGORICAL(log.product_id)[%s]"
    for i, row in enumerate(true_result_rows):
        for j in range(3):
            value = fm[tempname % (j)][i]
            if isinstance(row, set):
                assert pd.isnull(value) or value in row
            else:
                assert (pd.isnull(value) and pd.isnull(row[j])) or value == row[j]


def test_direct_with_invalid_init_args(diamond_es):
    customer_to_region = diamond_es.get_forward_relationships("customers")[0]
    error_text = "child_dataframe must be the relationship child dataframe"
    with pytest.raises(AssertionError, match=error_text):
        DirectFeature(
            IdentityFeature(diamond_es["regions"].ww["name"]),
            "stores",
            relationship=customer_to_region,
        )

    transaction_relationships = diamond_es.get_forward_relationships("transactions")
    transaction_to_store = next(
        r for r in transaction_relationships if r.parent_dataframe.ww.name == "stores"
    )
    error_text = "Base feature must be defined on the relationship parent dataframe"
    with pytest.raises(AssertionError, match=error_text):
        DirectFeature(
            IdentityFeature(diamond_es["regions"].ww["name"]),
            "transactions",
            relationship=transaction_to_store,
        )


def test_direct_with_multiple_possible_paths(games_es):
    error_text = (
        "There are multiple relationships to the base dataframe. "
        "You must specify a relationship."
    )
    with pytest.raises(RuntimeError, match=error_text):
        DirectFeature(IdentityFeature(games_es["teams"].ww["name"]), "games")

    # Does not raise if path specified.
    relationship = next(
        r
        for r in games_es.get_forward_relationships("games")
        if r._child_column_name == "home_team_id"
    )
    feat = DirectFeature(
        IdentityFeature(games_es["teams"].ww["name"]),
        "games",
        relationship=relationship,
    )
    assert feat.relationship_path_name() == "teams[home_team_id]"
    assert feat.get_name() == "teams[home_team_id].name"


def test_direct_with_single_possible_path(es):
    feat = DirectFeature(IdentityFeature(es["customers"].ww["age"]), "sessions")
    assert feat.relationship_path_name() == "customers"
    assert feat.get_name() == "customers.age"


def test_direct_with_no_path(diamond_es):
    error_text = 'No relationship from "regions" to "customers" found.'
    with pytest.raises(RuntimeError, match=error_text):
        DirectFeature(IdentityFeature(diamond_es["customers"].ww["name"]), "regions")

    error_text = 'No relationship from "customers" to "customers" found.'
    with pytest.raises(RuntimeError, match=error_text):
        DirectFeature(IdentityFeature(diamond_es["customers"].ww["name"]), "customers")


def test_serialization(es):
    value = IdentityFeature(es["products"].ww["rating"])
    direct = DirectFeature(value, "log")

    log_to_products = next(
        r
        for r in es.get_forward_relationships("log")
        if r.parent_dataframe.ww.name == "products"
    )
    dictionary = {
        "name": direct.get_name(),
        "base_feature": value.unique_name(),
        "relationship": log_to_products.to_dictionary(),
    }

    assert dictionary == direct.get_arguments()
    assert direct == DirectFeature.from_dictionary(
        dictionary,
        es,
        {value.unique_name(): value},
        PrimitivesDeserializer(),
    )


================================================
FILE: featuretools/tests/primitive_tests/test_feature_base.py
================================================
import os.path
import re

import pytest
from pympler.asizeof import asizeof
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime, Integer

from featuretools import Feature, config, feature_base
from featuretools.feature_base import IdentityFeature
from featuretools.primitives import (
    Count,
    Diff,
    Last,
    Mode,
    Negate,
    NMostCommon,
    NumUnique,
    Sum,
    TransformPrimitive,
)
from featuretools.synthesis.deep_feature_synthesis import can_stack_primitive_on_inputs
from featuretools.tests.testing_utils import check_rename


def test_copy_features_does_not_copy_entityset(es):
    agg = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="sessions",
        primitive=Sum,
    )
    agg_where = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="sessions",
        where=IdentityFeature(es["log"].ww["value"]) == 2,
        primitive=Sum,
    )
    agg_use_previous = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="sessions",
        use_previous="4 days",
        primitive=Sum,
    )
    agg_use_previous_where = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="sessions",
        where=IdentityFeature(es["log"].ww["value"]) == 2,
        use_previous="4 days",
        primitive=Sum,
    )
    features = [agg, agg_where, agg_use_previous, agg_use_previous_where]
    in_memory_size = asizeof(locals())
    copied = [f.copy() for f in features]
    new_in_memory_size = asizeof(locals())
    assert new_in_memory_size < 2 * in_memory_size


def test_get_dependencies(es):
    f = Feature(es["log"].ww["value"])
    agg1 = Feature(f, parent_dataframe_name="sessions", primitive=Sum)
    agg2 = Feature(agg1, parent_dataframe_name="customers", primitive=Sum)
    d1 = Feature(agg2, "sessions")
    shallow = d1.get_dependencies(deep=False, ignored=None)
    deep = d1.get_dependencies(deep=True, ignored=None)
    ignored = set([agg1.unique_name()])
    deep_ignored = d1.get_dependencies(deep=True, ignored=ignored)
    assert [s.unique_name() for s in shallow] == [agg2.unique_name()]
    assert [d.unique_name() for d in deep] == [
        agg2.unique_name(),
        agg1.unique_name(),
        f.unique_name(),
    ]
    assert [d.unique_name() for d in deep_ignored] == [agg2.unique_name()]


def test_get_depth(es):
    f = Feature(es["log"].ww["value"])
    g = Feature(es["log"].ww["value"])
    agg1 = Feature(f, parent_dataframe_name="sessions", primitive=Last)
    agg2 = Feature(agg1, parent_dataframe_name="customers", primitive=Last)
    d1 = Feature(agg2, "sessions")
    d2 = Feature(d1, "log")
    assert d2.get_depth() == 4
    # Make sure this works if we pass in two of the same
    # feature. This came up when user supplied duplicates
    # in the seed_features of DFS.
    assert d2.get_depth(stop_at=[f, g]) == 4
    assert d2.get_depth(stop_at=[f, g, agg1]) == 3
    assert d2.get_depth(stop_at=[f, g, agg1]) == 3
    assert d2.get_depth(stop_at=[f, g, agg2]) == 2
    assert d2.get_depth(stop_at=[f, g, d1]) == 1
    assert d2.get_depth(stop_at=[f, g, d2]) == 0


def test_squared(es):
    feature = Feature(es["log"].ww["value"])
    squared = feature * feature
    assert len(squared.base_features) == 2
    assert (
        squared.base_features[0].unique_name() == squared.base_features[1].unique_name()
    )


def test_return_type_inference(es):
    mode = Feature(
        es["log"].ww["priority_level"],
        parent_dataframe_name="customers",
        primitive=Mode,
    )
    assert (
        mode.column_schema
        == IdentityFeature(es["log"].ww["priority_level"]).column_schema
    )


def test_return_type_inference_direct_feature(es):
    mode = Feature(
        es["log"].ww["priority_level"],
        parent_dataframe_name="customers",
        primitive=Mode,
    )
    mode_session = Feature(mode, "sessions")
    assert (
        mode_session.column_schema
        == IdentityFeature(es["log"].ww["priority_level"]).column_schema
    )


def test_return_type_inference_index(es):
    last = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Last,
    )
    assert "index" not in last.column_schema.semantic_tags
    assert isinstance(last.column_schema.logical_type, Integer)


def test_return_type_inference_datetime_time_index(es):
    last = Feature(
        es["log"].ww["datetime"],
        parent_dataframe_name="customers",
        primitive=Last,
    )
    assert isinstance(last.column_schema.logical_type, Datetime)


def test_return_type_inference_numeric_time_index(int_es):
    last = Feature(
        int_es["log"].ww["datetime"],
        parent_dataframe_name="customers",
        primitive=Last,
    )
    assert "numeric" in last.column_schema.semantic_tags


def test_return_type_inference_id(es):
    # direct features should keep foreign key tag
    direct_id_feature = Feature(es["sessions"].ww["customer_id"], "log")
    assert "foreign_key" in direct_id_feature.column_schema.semantic_tags

    # aggregations of foreign key types should get converted
    last_feat = Feature(
        es["log"].ww["session_id"],
        parent_dataframe_name="customers",
        primitive=Last,
    )
    assert "foreign_key" not in last_feat.column_schema.semantic_tags
    assert isinstance(last_feat.column_schema.logical_type, Integer)

    # also test direct feature of aggregation
    last_direct = Feature(last_feat, "sessions")
    assert "foreign_key" not in last_direct.column_schema.semantic_tags
    assert isinstance(last_direct.column_schema.logical_type, Integer)


def test_set_data_path(es):
    key = "primitive_data_folder"

    # Don't change orig_path
    orig_path = config.get(key)
    new_path = "/example/new/directory"
    filename = "test.csv"

    # Test that default path works
    sum_prim = Sum()
    assert sum_prim.get_filepath(filename) == os.path.join(orig_path, filename)

    # Test that new path works
    config.set({key: new_path})
    assert sum_prim.get_filepath(filename) == os.path.join(new_path, filename)

    # Test that new path with trailing / works
    new_path += "/"
    config.set({key: new_path})
    assert sum_prim.get_filepath(filename) == os.path.join(new_path, filename)

    # Test that the path is correct on newly defined feature
    sum_prim2 = Sum()
    assert sum_prim2.get_filepath(filename) == os.path.join(new_path, filename)

    # Ensure path was reset
    config.set({key: orig_path})
    assert config.get(key) == orig_path


def test_to_dictionary_direct(es):
    actual = Feature(
        IdentityFeature(es["sessions"].ww["customer_id"]),
        "log",
    ).to_dictionary()

    expected = {
        "type": "DirectFeature",
        "dependencies": ["sessions: customer_id"],
        "arguments": {
            "name": "sessions.customer_id",
            "base_feature": "sessions: customer_id",
            "relationship": {
                "parent_dataframe_name": "sessions",
                "child_dataframe_name": "log",
                "parent_column_name": "id",
                "child_column_name": "session_id",
            },
        },
    }

    assert expected == actual


def test_to_dictionary_identity(es):
    actual = Feature(es["sessions"].ww["customer_id"]).to_dictionary()

    expected = {
        "type": "IdentityFeature",
        "dependencies": [],
        "arguments": {
            "name": "customer_id",
            "column_name": "customer_id",
            "dataframe_name": "sessions",
        },
    }

    assert expected == actual


def test_to_dictionary_agg(es):
    primitive = Sum()
    actual = Feature(
        es["customers"].ww["age"],
        primitive=primitive,
        parent_dataframe_name="cohorts",
    ).to_dictionary()

    expected = {
        "type": "AggregationFeature",
        "dependencies": ["customers: age"],
        "arguments": {
            "name": "SUM(customers.age)",
            "base_features": ["customers: age"],
            "relationship_path": [
                {
                    "parent_dataframe_name": "cohorts",
                    "child_dataframe_name": "customers",
                    "parent_column_name": "cohort",
                    "child_column_name": "cohort",
                },
            ],
            "primitive": primitive,
            "where": None,
            "use_previous": None,
        },
    }

    assert expected == actual


def test_to_dictionary_where(es):
    primitive = Sum()
    actual = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="sessions",
        where=IdentityFeature(es["log"].ww["value"]) == 2,
        primitive=primitive,
    ).to_dictionary()

    expected = {
        "type": "AggregationFeature",
        "dependencies": ["log: value", "log: value = 2"],
        "arguments": {
            "name": "SUM(log.value WHERE value = 2)",
            "base_features": ["log: value"],
            "relationship_path": [
                {
                    "parent_dataframe_name": "sessions",
                    "child_dataframe_name": "log",
                    "parent_column_name": "id",
                    "child_column_name": "session_id",
                },
            ],
            "primitive": primitive,
            "where": "log: value = 2",
            "use_previous": None,
        },
    }

    assert expected == actual


def test_to_dictionary_trans(es):
    primitive = Negate()
    trans_feature = Feature(es["customers"].ww["age"], primitive=primitive)

    expected = {
        "type": "TransformFeature",
        "dependencies": ["customers: age"],
        "arguments": {
            "name": "-(age)",
            "base_features": ["customers: age"],
            "primitive": primitive,
        },
    }

    assert expected == trans_feature.to_dictionary()


def test_to_dictionary_groupby_trans(es):
    primitive = Negate()
    id_feat = Feature(es["log"].ww["product_id"])
    groupby_feature = Feature(
        es["log"].ww["value"],
        primitive=primitive,
        groupby=id_feat,
    )

    expected = {
        "type": "GroupByTransformFeature",
        "dependencies": ["log: value", "log: product_id"],
        "arguments": {
            "name": "-(value) by product_id",
            "base_features": ["log: value"],
            "primitive": primitive,
            "groupby": "log: product_id",
        },
    }

    assert expected == groupby_feature.to_dictionary()


def test_to_dictionary_multi_slice(es):
    slice_feature = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="customers",
        primitive=NMostCommon(n=2),
    )[0]

    expected = {
        "type": "FeatureOutputSlice",
        "dependencies": ["customers: N_MOST_COMMON(log.product_id, n=2)"],
        "arguments": {
            "name": "N_MOST_COMMON(log.product_id, n=2)[0]",
            "base_feature": "customers: N_MOST_COMMON(log.product_id, n=2)",
            "n": 0,
        },
    }

    assert expected == slice_feature.to_dictionary()


def test_multi_output_base_error_agg(es):
    three_common = NMostCommon(3)
    tc = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="sessions",
        primitive=three_common,
    )
    error_text = "Cannot stack on whole multi-output feature."
    with pytest.raises(ValueError, match=error_text):
        Feature(tc, parent_dataframe_name="customers", primitive=NumUnique)


def test_multi_output_base_error_trans(es):
    class TestTime(TransformPrimitive):
        name = "test_time"
        input_types = [ColumnSchema(logical_type=Datetime)]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        number_output_features = 6

    tc = Feature(es["customers"].ww["birthday"], primitive=TestTime)

    error_text = "Cannot stack on whole multi-output feature."
    with pytest.raises(ValueError, match=error_text):
        Feature(tc, primitive=Diff)


def test_multi_output_attributes(es):
    tc = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="sessions",
        primitive=NMostCommon,
    )

    assert tc.generate_name() == "N_MOST_COMMON(log.product_id)"
    assert tc.number_output_features == 3
    assert tc.base_features == ["<Feature: product_id>"]

    assert tc[0].generate_name() == "N_MOST_COMMON(log.product_id)[0]"
    assert tc[0].number_output_features == 1
    assert tc[0].base_features == [tc]
    assert tc.relationship_path == tc[0].relationship_path


def test_multi_output_index_error(es):
    error_text = "can only access slice of multi-output feature"
    three_common = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="sessions",
        primitive=NMostCommon,
    )

    with pytest.raises(AssertionError, match=error_text):
        single = Feature(
            es["log"].ww["product_id"],
            parent_dataframe_name="sessions",
            primitive=NumUnique,
        )
        single[0]

    error_text = "Cannot get item from slice of multi output feature"
    with pytest.raises(ValueError, match=error_text):
        three_common[0][0]

    error_text = "index is higher than the number of outputs"
    with pytest.raises(AssertionError, match=error_text):
        three_common[10]


def test_rename(es):
    feat = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    new_name = "session_test"
    new_names = ["session_test"]
    check_rename(feat, new_name, new_names)


def test_rename_multioutput(es):
    feat = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="customers",
        primitive=NMostCommon(n=2),
    )
    new_name = "session_test"
    new_names = ["session_test[0]", "session_test[1]"]
    check_rename(feat, new_name, new_names)


def test_rename_featureoutputslice(es):
    multi_output_feat = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="customers",
        primitive=NMostCommon(n=2),
    )
    feat = feature_base.FeatureOutputSlice(multi_output_feat, 0)
    new_name = "session_test"
    new_names = ["session_test"]
    check_rename(feat, new_name, new_names)


def test_set_feature_names_wrong_number_of_names(es):
    feat = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="customers",
        primitive=NMostCommon(n=2),
    )
    new_names = ["col1"]
    error_msg = re.escape(
        "Number of names provided must match the number of output features: 1 name(s) provided, 2 expected.",
    )
    with pytest.raises(ValueError, match=error_msg):
        feat.set_feature_names(new_names)


def test_set_feature_names_not_unique(es):
    feat = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="customers",
        primitive=NMostCommon(n=2),
    )
    new_names = ["col1", "col1"]
    error_msg = "Provided output feature names must be unique."
    with pytest.raises(ValueError, match=error_msg):
        feat.set_feature_names(new_names)


def test_set_feature_names_error_on_single_output_feature(es):
    feat = Feature(es["sessions"].ww["device_name"], "log")
    new_names = ["sessions_device"]
    error_msg = "The set_feature_names can only be used on features that have more than one output column."
    with pytest.raises(ValueError, match=error_msg):
        feat.set_feature_names(new_names)


def test_set_feature_names_transform_feature(es):
    class MultiCumulative(TransformPrimitive):
        name = "multi_cum_sum"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        number_output_features = 3

    feat = Feature(es["log"].ww["value"], primitive=MultiCumulative)
    new_names = ["cumulative_sum", "cumulative_max", "cumulative_min"]
    feat.set_feature_names(new_names)
    assert feat.get_feature_names() == new_names


def test_set_feature_names_aggregation_feature(es):
    feat = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="customers",
        primitive=NMostCommon(n=2),
    )
    new_names = ["agg_col_1", "second_agg_col"]
    feat.set_feature_names(new_names)
    assert feat.get_feature_names() == new_names


def test_renaming_resets_feature_output_names_to_default(es):
    feat = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="customers",
        primitive=NMostCommon(n=2),
    )
    new_names = ["renamed1", "renamed2"]
    feat.set_feature_names(new_names)
    assert feat.get_feature_names() == new_names

    feat = feat.rename("new_feature_name")
    assert feat.get_feature_names() == ["new_feature_name[0]", "new_feature_name[1]"]


def test_base_of_and_stack_on_heuristic(es, test_aggregation_primitive):
    child = Feature(
        es["sessions"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    test_aggregation_primitive.stack_on = []
    child.primitive.base_of = []
    assert not can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])

    test_aggregation_primitive.stack_on = []
    child.primitive.base_of = None
    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])

    test_aggregation_primitive.stack_on = []
    child.primitive.base_of = [test_aggregation_primitive]
    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])

    test_aggregation_primitive.stack_on = None
    child.primitive.base_of = []
    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])

    test_aggregation_primitive.stack_on = None
    child.primitive.base_of = None
    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])

    test_aggregation_primitive.stack_on = None
    child.primitive.base_of = [test_aggregation_primitive]
    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])

    test_aggregation_primitive.stack_on = [type(child.primitive)]
    child.primitive.base_of = []
    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])

    test_aggregation_primitive.stack_on = [type(child.primitive)]
    child.primitive.base_of = None
    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])

    test_aggregation_primitive.stack_on = [type(child.primitive)]
    child.primitive.base_of = [test_aggregation_primitive]
    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])

    test_aggregation_primitive.stack_on = None
    child.primitive.base_of = None
    child.primitive.base_of_exclude = [test_aggregation_primitive]
    assert not can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])

    test_aggregation_primitive.stack_on_exclude = [Count]
    assert not can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])

    child.primitive.number_output_features = 2
    test_aggregation_primitive.stack_on_exclude = []
    test_aggregation_primitive.stack_on = []
    child.primitive.base_of = []
    assert not can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])


def test_stack_on_self(es, test_transform_primitive):
    # test stacks on self
    child = Feature(
        es["log"].ww["value"],
        primitive=test_transform_primitive,
    )
    test_transform_primitive.stack_on = []
    child.primitive.base_of = []
    test_transform_primitive.stack_on_self = False
    child.primitive.stack_on_self = False
    assert not can_stack_primitive_on_inputs(test_transform_primitive(), [child])

    test_transform_primitive.stack_on_self = True
    assert can_stack_primitive_on_inputs(test_transform_primitive(), [child])

    test_transform_primitive.stack_on = None
    test_transform_primitive.stack_on_self = False
    assert not can_stack_primitive_on_inputs(test_transform_primitive(), [child])


================================================
FILE: featuretools/tests/primitive_tests/test_feature_descriptions.py
================================================
import json
import os

import pytest
from woodwork.column_schema import ColumnSchema

from featuretools import describe_feature
from featuretools.feature_base import (
    AggregationFeature,
    DirectFeature,
    GroupByTransformFeature,
    IdentityFeature,
    TransformFeature,
)
from featuretools.primitives import (
    Absolute,
    AggregationPrimitive,
    CumMean,
    EqualScalar,
    Mean,
    Mode,
    NMostCommon,
    NumUnique,
    PercentTrue,
    Sum,
    TransformPrimitive,
)


def test_identity_description(es):
    feature = IdentityFeature(es["log"].ww["session_id"])
    description = 'The "session_id".'

    assert describe_feature(feature) == description


def test_direct_description(es):
    feature = DirectFeature(
        IdentityFeature(es["customers"].ww["loves_ice_cream"]),
        "sessions",
    )
    description = (
        'The "loves_ice_cream" for the instance of "customers" associated '
        'with this instance of "sessions".'
    )
    assert describe_feature(feature) == description

    deep_direct = DirectFeature(feature, "log")
    deep_description = (
        'The "loves_ice_cream" for the instance of "customers" '
        'associated with the instance of "sessions" associated with '
        'this instance of "log".'
    )
    assert describe_feature(deep_direct) == deep_description

    agg = AggregationFeature(
        IdentityFeature(es["log"].ww["purchased"]),
        "sessions",
        PercentTrue,
    )
    complicated_direct = DirectFeature(agg, "log")
    agg_on_direct = AggregationFeature(complicated_direct, "products", Mean)

    complicated_description = (
        "The average of the percentage of true values in "
        'the "purchased" of all instances of "log" for each "id" in "sessions" for '
        'the instance of "sessions" associated with this instance of "log" of all '
        'instances of "log" for each "id" in "products".'
    )
    assert describe_feature(agg_on_direct) == complicated_description


def test_transform_description(es):
    feature = TransformFeature(IdentityFeature(es["log"].ww["value"]), Absolute)
    description = 'The absolute value of the "value".'
    assert describe_feature(feature) == description


def test_groupby_transform_description(es):
    feature = GroupByTransformFeature(
        IdentityFeature(es["log"].ww["value"]),
        CumMean,
        IdentityFeature(es["log"].ww["session_id"]),
    )
    description = 'The cumulative mean of the "value" for each "session_id".'

    assert describe_feature(feature) == description


def test_aggregation_description(es):
    feature = AggregationFeature(
        IdentityFeature(es["log"].ww["value"]),
        "sessions",
        Mean,
    )
    description = 'The average of the "value" of all instances of "log" for each "id" in "sessions".'
    assert describe_feature(feature) == description

    stacked_agg = AggregationFeature(feature, "customers", Sum)
    stacked_description = (
        'The sum of t{} of all instances of "sessions" for each "id" '
        'in "customers".'.format(description[1:-1])
    )
    assert describe_feature(stacked_agg) == stacked_description


def test_aggregation_description_where(es):
    where_feature = TransformFeature(
        IdentityFeature(es["log"].ww["countrycode"]),
        EqualScalar("US"),
    )
    feature = AggregationFeature(
        IdentityFeature(es["log"].ww["value"]),
        "sessions",
        Mean,
        where=where_feature,
    )
    description = (
        'The average of the "value" of all instances of "log" where the '
        '"countrycode" is US for each "id" in "sessions".'
    )

    assert describe_feature(feature) == description


def test_aggregation_description_use_previous(es):
    feature = AggregationFeature(
        IdentityFeature(es["log"].ww["value"]),
        "sessions",
        Mean,
        use_previous="5d",
    )
    description = 'The average of the "value" of the previous 5 days of "log" for each "id" in "sessions".'

    assert describe_feature(feature) == description


def test_multioutput_description(es):
    n_most_common = NMostCommon(2)
    n_most_common_feature = AggregationFeature(
        IdentityFeature(es["log"].ww["zipcode"]),
        "sessions",
        n_most_common,
    )
    first_most_common_slice = n_most_common_feature[0]
    second_most_common_slice = n_most_common_feature[1]

    n_most_common_base = 'The 2 most common values of the "zipcode" of all instances of "log" for each "id" in "sessions".'
    n_most_common_first = (
        'The most common value of the "zipcode" of all instances of "log" '
        'for each "id" in "sessions".'
    )
    n_most_common_second = (
        'The 2nd most common value of the "zipcode" of all instances of '
        '"log" for each "id" in "sessions".'
    )

    assert describe_feature(n_most_common_feature) == n_most_common_base
    assert describe_feature(first_most_common_slice) == n_most_common_first
    assert describe_feature(second_most_common_slice) == n_most_common_second

    class CustomMultiOutput(TransformPrimitive):
        name = "custom_multioutput"
        input_types = [ColumnSchema(semantic_tags={"category"})]
        return_type = ColumnSchema(semantic_tags={"category"})

        number_output_features = 4

    custom_feat = TransformFeature(
        IdentityFeature(es["log"].ww["zipcode"]),
        CustomMultiOutput,
    )

    generic_base = 'The result of applying CUSTOM_MULTIOUTPUT to the "zipcode".'
    generic_first = 'The 1st output from applying CUSTOM_MULTIOUTPUT to the "zipcode".'
    generic_second = 'The 2nd output from applying CUSTOM_MULTIOUTPUT to the "zipcode".'

    assert describe_feature(custom_feat) == generic_base
    assert describe_feature(custom_feat[0]) == generic_first
    assert describe_feature(custom_feat[1]) == generic_second

    CustomMultiOutput.description_template = [
        "the multioutput of {}",
        "the {nth_slice} multioutput part of {}",
    ]
    template_base = 'The multioutput of the "zipcode".'
    template_first_slice = 'The 1st multioutput part of the "zipcode".'
    template_second_slice = 'The 2nd multioutput part of the "zipcode".'
    template_third_slice = 'The 3rd multioutput part of the "zipcode".'
    template_fourth_slice = 'The 4th multioutput part of the "zipcode".'
    assert describe_feature(custom_feat) == template_base
    assert describe_feature(custom_feat[0]) == template_first_slice
    assert describe_feature(custom_feat[1]) == template_second_slice
    assert describe_feature(custom_feat[2]) == template_third_slice
    assert describe_feature(custom_feat[3]) == template_fourth_slice

    CustomMultiOutput.description_template = [
        "the multioutput of {}",
        "the primary multioutput part of {}",
        "the secondary multioutput part of {}",
    ]
    custom_base = 'The multioutput of the "zipcode".'
    custom_first_slice = 'The primary multioutput part of the "zipcode".'
    custom_second_slice = 'The secondary multioutput part of the "zipcode".'
    bad_slice_error = "Slice out of range of template"
    assert describe_feature(custom_feat) == custom_base
    assert describe_feature(custom_feat[0]) == custom_first_slice
    assert describe_feature(custom_feat[1]) == custom_second_slice
    with pytest.raises(IndexError, match=bad_slice_error):
        describe_feature(custom_feat[2])


def test_generic_description(es):
    class NoName(TransformPrimitive):
        input_types = [ColumnSchema(semantic_tags={"category"})]
        output_type = ColumnSchema(semantic_tags={"category"})

        def generate_name(self, base_feature_names):
            return "%s(%s%s)" % (
                "NO_NAME",
                ", ".join(base_feature_names),
                self.get_args_string(),
            )

    class CustomAgg(AggregationPrimitive):
        name = "custom_aggregation"
        input_types = [ColumnSchema(semantic_tags={"category"})]
        output_type = ColumnSchema(semantic_tags={"category"})

    class CustomTrans(TransformPrimitive):
        name = "custom_transform"
        input_types = [ColumnSchema(semantic_tags={"category"})]
        output_type = ColumnSchema(semantic_tags={"category"})

    no_name = TransformFeature(IdentityFeature(es["log"].ww["zipcode"]), NoName)
    no_name_description = 'The result of applying NoName to the "zipcode".'
    assert describe_feature(no_name) == no_name_description

    custom_agg = AggregationFeature(
        IdentityFeature(es["log"].ww["zipcode"]),
        "customers",
        CustomAgg,
    )
    custom_agg_description = 'The result of applying CUSTOM_AGGREGATION to the "zipcode" of all instances of "log" for each "id" in "customers".'
    assert describe_feature(custom_agg) == custom_agg_description

    custom_trans = TransformFeature(
        IdentityFeature(es["log"].ww["zipcode"]),
        CustomTrans,
    )
    custom_trans_description = (
        'The result of applying CUSTOM_TRANSFORM to the "zipcode".'
    )
    assert describe_feature(custom_trans) == custom_trans_description


def test_column_description(es):
    column_description = "the name of the device used for each session"
    es["sessions"].ww.columns["device_name"].description = column_description
    identity_feat = IdentityFeature(es["sessions"].ww["device_name"])
    assert (
        describe_feature(identity_feat)
        == column_description[0].upper() + column_description[1:] + "."
    )


def test_metadata(es, tmp_path):
    identity_feature_descriptions = {
        "sessions: device_name": "the name of the device used for each session",
        "customers: id": "the customer's id",
    }
    agg_feat = AggregationFeature(
        IdentityFeature(es["sessions"].ww["device_name"]),
        "customers",
        NumUnique,
    )
    agg_description = (
        "The number of unique elements in the name of the device used for each "
        'session of all instances of "sessions" for each customer\'s id.'
    )
    assert (
        describe_feature(agg_feat, feature_descriptions=identity_feature_descriptions)
        == agg_description
    )

    transform_feat = GroupByTransformFeature(
        IdentityFeature(es["log"].ww["value"]),
        CumMean,
        IdentityFeature(es["log"].ww["session_id"]),
    )
    transform_description = 'The running average of the "value" for each "session_id".'
    primitive_templates = {"cum_mean": "the running average of {}"}
    assert (
        describe_feature(transform_feat, primitive_templates=primitive_templates)
        == transform_description
    )

    custom_agg = AggregationFeature(
        IdentityFeature(es["log"].ww["zipcode"]),
        "sessions",
        Mode,
    )
    auto_description = 'The most frequently occurring value of the "zipcode" of all instances of "log" for each "id" in "sessions".'
    custom_agg_description = "the most frequently used zipcode"
    custom_feature_description = (
        custom_agg_description[0].upper() + custom_agg_description[1:] + "."
    )
    feature_description_dict = {"sessions: MODE(log.zipcode)": custom_agg_description}
    assert describe_feature(custom_agg) == auto_description
    assert (
        describe_feature(custom_agg, feature_descriptions=feature_description_dict)
        == custom_feature_description
    )

    metadata = {
        "feature_descriptions": {
            **identity_feature_descriptions,
            **feature_description_dict,
        },
        "primitive_templates": primitive_templates,
    }
    metadata_path = os.path.join(tmp_path, "description_metadata.json")
    with open(metadata_path, "w") as f:
        json.dump(metadata, f)
    assert describe_feature(agg_feat, metadata_file=metadata_path) == agg_description
    assert (
        describe_feature(transform_feat, metadata_file=metadata_path)
        == transform_description
    )
    assert (
        describe_feature(custom_agg, metadata_file=metadata_path)
        == custom_feature_description
    )


================================================
FILE: featuretools/tests/primitive_tests/test_feature_serialization.py
================================================
import os

import boto3
import pandas as pd
import pytest
from pympler.asizeof import asizeof
from smart_open import open
from woodwork.column_schema import ColumnSchema

from featuretools import (
    AggregationFeature,
    DirectFeature,
    EntitySet,
    Feature,
    GroupByTransformFeature,
    IdentityFeature,
    TransformFeature,
    dfs,
    feature_base,
    load_features,
    primitives,
    save_features,
)
from featuretools.feature_base import FeatureOutputSlice
from featuretools.feature_base.cache import feature_cache
from featuretools.feature_base.features_deserializer import FeaturesDeserializer
from featuretools.feature_base.features_serializer import FeaturesSerializer
from featuretools.primitives import (
    Count,
    CumSum,
    Day,
    DistanceToHoliday,
    Haversine,
    IsIn,
    Max,
    Mean,
    Min,
    Mode,
    Month,
    MultiplyNumericScalar,
    Negate,
    NMostCommon,
    NumberOfCommonWords,
    NumCharacters,
    NumUnique,
    NumWords,
    PercentTrue,
    Skew,
    Std,
    Sum,
    TransformPrimitive,
    Weekday,
    Year,
)
from featuretools.primitives.base import AggregationPrimitive
from featuretools.tests.testing_utils import check_names
from featuretools.version import ENTITYSET_SCHEMA_VERSION, FEATURES_SCHEMA_VERSION

BUCKET_NAME = "test-bucket"
WRITE_KEY_NAME = "test-key"
TEST_S3_URL = "s3://{}/{}".format(BUCKET_NAME, WRITE_KEY_NAME)
TEST_FILE = "test_feature_serialization_feature_schema_{}_entityset_schema_{}_2022_12_28.json".format(
    FEATURES_SCHEMA_VERSION,
    ENTITYSET_SCHEMA_VERSION,
)
S3_URL = "s3://featuretools-static/" + TEST_FILE
URL = "https://featuretools-static.s3.amazonaws.com/" + TEST_FILE
TEST_CONFIG = "CheckConfigPassesOn"
TEST_KEY = "test_access_key_features"


@pytest.fixture(autouse=True)
def reset_dfs_cache():
    feature_cache.enabled = False
    feature_cache.clear_all()


def assert_features(original, deserialized):
    for feat_1, feat_2 in zip(original, deserialized):
        assert feat_1.unique_name() == feat_2.unique_name()
        assert feat_1.entityset == feat_2.entityset


def pickle_features_test_helper(es_size, features_original, dir_path):
    filepath = os.path.join(dir_path, "test_feature")

    save_features(features_original, filepath)
    features_deserializedA = load_features(filepath)
    assert os.path.getsize(filepath) < es_size
    os.remove(filepath)

    with open(filepath, "w") as f:
        save_features(features_original, f)
    features_deserializedB = load_features(open(filepath))
    assert os.path.getsize(filepath) < es_size
    os.remove(filepath)

    features = save_features(features_original)
    features_deserializedC = load_features(features)
    assert asizeof(features) < es_size

    features_deserialized_options = [
        features_deserializedA,
        features_deserializedB,
        features_deserializedC,
    ]
    for features_deserialized in features_deserialized_options:
        assert_features(features_original, features_deserialized)


def test_pickle_features(es, tmp_path):
    features_original = dfs(
        target_dataframe_name="sessions",
        entityset=es,
        features_only=True,
    )
    pickle_features_test_helper(asizeof(es), features_original, str(tmp_path))


def test_pickle_features_with_custom_primitive(es, tmp_path):
    class NewMax(AggregationPrimitive):
        name = "new_max"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})

    features_original = dfs(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=["Last", "Mean", NewMax],
        features_only=True,
    )

    assert any([isinstance(feat.primitive, NewMax) for feat in features_original])
    pickle_features_test_helper(asizeof(es), features_original, str(tmp_path))


def test_serialized_renamed_features(es):
    def serialize_name_unchanged(original):
        new_name = "MyFeature"
        original_names = original.get_feature_names()
        renamed = original.rename(new_name)
        new_names = (
            [new_name]
            if len(original_names) == 1
            else [new_name + "[{}]".format(i) for i in range(len(original_names))]
        )
        check_names(renamed, new_name, new_names)

        serializer = FeaturesSerializer([renamed])
        serialized = serializer.to_dict()

        deserializer = FeaturesDeserializer(serialized)
        deserialized = deserializer.to_list()[0]
        check_names(deserialized, new_name, new_names)

    identity_original = IdentityFeature(es["log"].ww["value"])
    assert identity_original.get_name() == "value"

    value = IdentityFeature(es["log"].ww["value"])

    primitive = primitives.Max()
    agg_original = AggregationFeature(value, "customers", primitive)
    assert agg_original.get_name() == "MAX(log.value)"

    direct_original = DirectFeature(
        IdentityFeature(es["customers"].ww["age"]),
        "sessions",
    )
    assert direct_original.get_name() == "customers.age"

    primitive = primitives.MultiplyNumericScalar(value=2)
    transform_original = TransformFeature(value, primitive)
    assert transform_original.get_name() == "value * 2"

    zipcode = IdentityFeature(es["log"].ww["zipcode"])
    primitive = CumSum()
    groupby_original = feature_base.GroupByTransformFeature(value, primitive, zipcode)
    assert groupby_original.get_name() == "CUM_SUM(value) by zipcode"

    multioutput_original = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="customers",
        primitive=NMostCommon(n=2),
    )
    assert multioutput_original.get_name() == "N_MOST_COMMON(log.product_id, n=2)"

    featureslice_original = feature_base.FeatureOutputSlice(multioutput_original, 0)
    assert featureslice_original.get_name() == "N_MOST_COMMON(log.product_id, n=2)[0]"

    feature_type_list = [
        identity_original,
        agg_original,
        direct_original,
        transform_original,
        groupby_original,
        multioutput_original,
        featureslice_original,
    ]

    for feature_type in feature_type_list:
        serialize_name_unchanged(feature_type)


@pytest.fixture
def s3_client():
    _environ = os.environ.copy()
    from moto import mock_aws

    with mock_aws():
        s3 = boto3.resource("s3")
        yield s3
    os.environ.clear()
    os.environ.update(_environ)


@pytest.fixture
def s3_bucket(s3_client, region="us-east-2"):
    location = {"LocationConstraint": region}
    s3_client.create_bucket(
        Bucket=BUCKET_NAME,
        ACL="public-read-write",
        CreateBucketConfiguration=location,
    )
    s3_bucket = s3_client.Bucket(BUCKET_NAME)
    yield s3_bucket


def test_serialize_features_mock_s3(es, s3_client, s3_bucket):
    features_original = dfs(
        target_dataframe_name="sessions",
        entityset=es,
        features_only=True,
    )

    save_features(features_original, TEST_S3_URL)

    obj = list(s3_bucket.objects.all())[0].key
    s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL="public-read-write")

    features_deserialized = load_features(TEST_S3_URL)
    assert_features(features_original, features_deserialized)


def test_serialize_features_mock_anon_s3(es, s3_client, s3_bucket):
    features_original = dfs(
        target_dataframe_name="sessions",
        entityset=es,
        features_only=True,
    )

    save_features(features_original, TEST_S3_URL, profile_name=False)

    obj = list(s3_bucket.objects.all())[0].key
    s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL="public-read-write")

    features_deserialized = load_features(TEST_S3_URL, profile_name=False)
    assert_features(features_original, features_deserialized)


@pytest.mark.parametrize("profile_name", ["test", False])
def test_s3_test_profile(es, s3_client, s3_bucket, setup_test_profile, profile_name):
    features_original = dfs(
        target_dataframe_name="sessions",
        entityset=es,
        features_only=True,
    )

    save_features(features_original, TEST_S3_URL, profile_name="test")

    obj = list(s3_bucket.objects.all())[0].key
    s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL="public-read-write")

    features_deserialized = load_features(TEST_S3_URL, profile_name=profile_name)
    assert_features(features_original, features_deserialized)


@pytest.mark.parametrize("url,profile_name", [(S3_URL, False), (URL, None)])
def test_deserialize_features_s3(es, url, profile_name):
    agg_primitives = [
        Sum,
        Std,
        Max,
        Skew,
        Min,
        Mean,
        Count,
        PercentTrue,
        NumUnique,
        Mode,
    ]

    trans_primitives = [Day, Year, Month, Weekday, Haversine, NumWords, NumCharacters]

    features_original = dfs(
        target_dataframe_name="sessions",
        entityset=es,
        features_only=True,
        agg_primitives=agg_primitives,
        trans_primitives=trans_primitives,
    )

    features_deserialized = load_features(url, profile_name=profile_name)
    assert_features(features_original, features_deserialized)


def test_serialize_url(es):
    features_original = dfs(
        target_dataframe_name="sessions",
        entityset=es,
        features_only=True,
    )
    error_text = "Writing to URLs is not supported"
    with pytest.raises(ValueError, match=error_text):
        save_features(features_original, URL)


def test_custom_feature_names_retained_during_serialization(es, tmp_path):
    class MultiCumulative(TransformPrimitive):
        name = "multi_cum_sum"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        number_output_features = 3

    multi_output_trans_feat = Feature(
        es["log"].ww["value"],
        primitive=MultiCumulative,
    )
    groupby_trans_feat = GroupByTransformFeature(
        es["log"].ww["value"],
        primitive=MultiCumulative,
        groupby=es["log"].ww["product_id"],
    )
    multi_output_agg_feat = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="customers",
        primitive=NMostCommon(n=2),
    )
    slice = FeatureOutputSlice(multi_output_trans_feat, 1)
    stacked_feat = Feature(slice, primitive=Negate)

    trans_names = ["cumulative_sum", "cumulative_max", "cumulative_min"]
    multi_output_trans_feat.set_feature_names(trans_names)
    groupby_trans_names = ["grouped_sum", "grouped_max", "grouped_min"]
    groupby_trans_feat.set_feature_names(groupby_trans_names)
    agg_names = ["first_most_common", "second_most_common"]
    multi_output_agg_feat.set_feature_names(agg_names)

    features = [
        multi_output_trans_feat,
        multi_output_agg_feat,
        groupby_trans_feat,
        stacked_feat,
    ]
    file = os.path.join(tmp_path, "features.json")
    save_features(features, file)
    deserialized_features = load_features(file)

    new_trans, new_agg, new_groupby, new_stacked = deserialized_features
    assert new_trans.get_feature_names() == trans_names
    assert new_agg.get_feature_names() == agg_names
    assert new_groupby.get_feature_names() == groupby_trans_names
    assert new_stacked.get_feature_names() == ["-(cumulative_max)"]


def test_deserializer_uses_common_primitive_instances_no_args(es, tmp_path):
    features = dfs(
        entityset=es,
        target_dataframe_name="products",
        features_only=True,
        agg_primitives=["sum"],
        trans_primitives=["is_null"],
    )

    is_null_features = [f for f in features if f.primitive.name == "is_null"]
    sum_features = [f for f in features if f.primitive.name == "sum"]

    # Make sure we have multiple features of each type
    assert len(is_null_features) > 1
    assert len(sum_features) > 1

    # DFS should use the same primitive instance for all features that share a primitive
    is_null_primitive = is_null_features[0].primitive
    sum_primitive = sum_features[0].primitive
    assert all([f.primitive is is_null_primitive for f in is_null_features])
    assert all([f.primitive is sum_primitive for f in sum_features])

    file = os.path.join(tmp_path, "features.json")
    save_features(features, file)
    deserialized_features = load_features(file)
    new_is_null_features = [
        f for f in deserialized_features if f.primitive.name == "is_null"
    ]
    new_sum_features = [f for f in deserialized_features if f.primitive.name == "sum"]

    # After deserialization all features that share a primitive should use the same primitive instance
    new_is_null_primitive = new_is_null_features[0].primitive
    new_sum_primitive = new_sum_features[0].primitive
    assert all([f.primitive is new_is_null_primitive for f in new_is_null_features])
    assert all([f.primitive is new_sum_primitive for f in new_sum_features])


def test_deserializer_uses_common_primitive_instances_with_args(es, tmp_path):
    # Single argument
    scalar1 = MultiplyNumericScalar(value=1)
    scalar5 = MultiplyNumericScalar(value=5)
    features = dfs(
        entityset=es,
        target_dataframe_name="products",
        features_only=True,
        agg_primitives=["sum"],
        trans_primitives=[scalar1, scalar5],
    )

    scalar1_features = [
        f
        for f in features
        if f.primitive.name == "multiply_numeric_scalar" and " * 1" in f.get_name()
    ]
    scalar5_features = [
        f
        for f in features
        if f.primitive.name == "multiply_numeric_scalar" and " * 5" in f.get_name()
    ]

    # Make sure we have multiple features of each type
    assert len(scalar1_features) > 1
    assert len(scalar5_features) > 1

    # DFS should use the the passed in primitive instance for all features
    assert all([f.primitive is scalar1 for f in scalar1_features])
    assert all([f.primitive is scalar5 for f in scalar5_features])

    file = os.path.join(tmp_path, "features.json")
    save_features(features, file)
    deserialized_features = load_features(file)

    new_scalar1_features = [
        f
        for f in deserialized_features
        if f.primitive.name == "multiply_numeric_scalar" and " * 1" in f.get_name()
    ]
    new_scalar5_features = [
        f
        for f in deserialized_features
        if f.primitive.name == "multiply_numeric_scalar" and " * 5" in f.get_name()
    ]

    # After deserialization all features that share a primitive should use the same primitive instance
    new_scalar1_primitive = new_scalar1_features[0].primitive
    new_scalar5_primitive = new_scalar5_features[0].primitive
    assert all([f.primitive is new_scalar1_primitive for f in new_scalar1_features])
    assert all([f.primitive is new_scalar5_primitive for f in new_scalar5_features])
    assert new_scalar1_primitive.value == 1
    assert new_scalar5_primitive.value == 5

    # Test primitive with multiple args
    distance_to_holiday = DistanceToHoliday(
        holiday="Canada Day",
        country="Canada",
    )
    features = dfs(
        entityset=es,
        target_dataframe_name="customers",
        features_only=True,
        agg_primitives=[],
        trans_primitives=[distance_to_holiday],
    )

    distance_features = [
        f for f in features if f.primitive.name == "distance_to_holiday"
    ]

    assert len(distance_features) > 1

    # DFS should use the the passed in primitive instance for all features
    assert all([f.primitive is distance_to_holiday for f in distance_features])

    file = os.path.join(tmp_path, "distance_features.json")
    save_features(distance_features, file)
    new_distance_features = load_features(file)

    # After deserialization all features that share a primitive should use the same primitive instance
    new_distance_primitive = new_distance_features[0].primitive
    assert all(
        [f.primitive is new_distance_primitive for f in new_distance_features],
    )
    assert new_distance_primitive.holiday == "Canada Day"
    assert new_distance_primitive.country == "Canada"

    # Test primitive with list arg
    is_in = IsIn(list_of_outputs=[5, True, "coke zero"])
    features = dfs(
        entityset=es,
        target_dataframe_name="customers",
        features_only=True,
        agg_primitives=[],
        trans_primitives=[is_in],
    )

    is_in_features = [f for f in features if f.primitive.name == "isin"]
    assert len(is_in_features) > 1

    # DFS should use the the passed in primitive instance for all features
    assert all([f.primitive is is_in for f in is_in_features])

    file = os.path.join(tmp_path, "distance_features.json")
    save_features(is_in_features, file)
    new_is_in_features = load_features(file)

    # After deserialization all features that share a primitive should use the same primitive instance
    new_is_in_primitive = new_is_in_features[0].primitive
    assert all([f.primitive is new_is_in_primitive for f in new_is_in_features])
    assert new_is_in_primitive.list_of_outputs == [5, True, "coke zero"]


def test_can_serialize_word_set_for_number_of_common_words_feature(es):
    # The word_set argument is passed in as a set, which is not JSON-serializable.
    # This test checks internal logic that converts the set to a list so it can be serialized
    common_word_set = {"hello", "my"}
    df = pd.DataFrame({"text": ["hello my name is hi"]})
    es = EntitySet()
    es.add_dataframe(dataframe_name="df", index="idx", dataframe=df, make_index=True)

    num_common_words = NumberOfCommonWords(word_set=common_word_set)
    fm, fd = dfs(
        entityset=es,
        target_dataframe_name="df",
        trans_primitives=[num_common_words],
    )

    feat = fd[-1]
    save_features([feat])


================================================
FILE: featuretools/tests/primitive_tests/test_feature_utils.py
================================================
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double, Integer

from featuretools.feature_base.utils import is_valid_input


def test_is_valid_input():
    assert is_valid_input(candidate=ColumnSchema(), template=ColumnSchema())

    assert is_valid_input(
        candidate=ColumnSchema(logical_type=Integer, semantic_tags={"index"}),
        template=ColumnSchema(logical_type=Integer, semantic_tags={"index"}),
    )

    assert is_valid_input(
        candidate=ColumnSchema(
            logical_type=Integer,
            semantic_tags={"index", "numeric"},
        ),
        template=ColumnSchema(semantic_tags={"index"}),
    )

    assert is_valid_input(
        candidate=ColumnSchema(semantic_tags={"index"}),
        template=ColumnSchema(semantic_tags={"index"}),
    )

    assert is_valid_input(
        candidate=ColumnSchema(logical_type=Integer, semantic_tags={"index"}),
        template=ColumnSchema(),
    )

    assert is_valid_input(
        candidate=ColumnSchema(logical_type=Integer),
        template=ColumnSchema(logical_type=Integer),
    )

    assert is_valid_input(
        candidate=ColumnSchema(logical_type=Integer, semantic_tags={"numeric"}),
        template=ColumnSchema(logical_type=Integer),
    )

    assert not is_valid_input(
        candidate=ColumnSchema(logical_type=Integer, semantic_tags={"index"}),
        template=ColumnSchema(logical_type=Double, semantic_tags={"index"}),
    )

    assert not is_valid_input(
        candidate=ColumnSchema(logical_type=Integer, semantic_tags={}),
        template=ColumnSchema(logical_type=Integer, semantic_tags={"index"}),
    )

    assert not is_valid_input(
        candidate=ColumnSchema(),
        template=ColumnSchema(logical_type=Integer, semantic_tags={"index"}),
    )

    assert not is_valid_input(
        candidate=ColumnSchema(),
        template=ColumnSchema(logical_type=Integer),
    )

    assert not is_valid_input(
        candidate=ColumnSchema(),
        template=ColumnSchema(semantic_tags={"index"}),
    )


================================================
FILE: featuretools/tests/primitive_tests/test_feature_visualizer.py
================================================
import json
import os
import re

import graphviz
import pytest

from featuretools.feature_base import (
    AggregationFeature,
    DirectFeature,
    FeatureOutputSlice,
    GroupByTransformFeature,
    IdentityFeature,
    TransformFeature,
    graph_feature,
)
from featuretools.primitives import Count, CumMax, Mode, NMostCommon, Year


@pytest.fixture
def simple_feat(es):
    return IdentityFeature(es["log"].ww["id"])


@pytest.fixture
def trans_feat(es):
    return TransformFeature(IdentityFeature(es["customers"].ww["cancel_date"]), Year)


def test_returns_digraph_object(simple_feat):
    graph = graph_feature(simple_feat)
    assert isinstance(graph, graphviz.Digraph)


def test_saving_png_file(simple_feat, tmp_path):
    output_path = str(tmp_path.joinpath("test1.png"))
    graph_feature(simple_feat, to_file=output_path)
    assert os.path.isfile(output_path)


def test_missing_file_extension(simple_feat):
    output_path = "test1"
    with pytest.raises(ValueError, match="Please use a file extension"):
        graph_feature(simple_feat, to_file=output_path)


def test_invalid_format(simple_feat):
    output_path = "test1.xyz"
    with pytest.raises(ValueError, match="Unknown format"):
        graph_feature(simple_feat, to_file=output_path)


def test_transform(es, trans_feat):
    feat = trans_feat
    graph = graph_feature(feat).source

    feat_name = feat.get_name()
    prim_node = "0_{}_year".format(feat_name)
    dataframe_table = "\u2605 customers (target)"
    prim_edge = 'customers:cancel_date -> "{}"'.format(prim_node)
    feat_edge = '"{}" -> customers:"{}"'.format(prim_node, feat_name)

    graph_components = [feat_name, dataframe_table, prim_node, prim_edge, feat_edge]
    for component in graph_components:
        assert component in graph

    matches = re.findall(r"customers \[label=<\n<TABLE.*?</TABLE>>", graph, re.DOTALL)
    assert len(matches) == 1
    rows = re.findall(r"<TR.*?</TR>", matches[0], re.DOTALL)
    assert len(rows) == 3
    to_match = ["customers", "cancel_date", feat_name]
    for match, row in zip(to_match, rows):
        assert match in row


def test_html_symbols(es, tmp_path):
    output_path_template = str(tmp_path.joinpath("test{}.png"))
    value = IdentityFeature(es["log"].ww["value"])
    gt = value > 5
    lt = value < 5
    ge = value >= 5
    le = value <= 5

    for i, feat in enumerate([gt, lt, ge, le]):
        output_path = output_path_template.format(i)
        graph = graph_feature(feat, to_file=output_path).source
        assert os.path.isfile(output_path)
        assert feat.get_name() in graph


def test_groupby_transform(es):
    feat = GroupByTransformFeature(
        IdentityFeature(es["customers"].ww["age"]),
        CumMax,
        IdentityFeature(es["customers"].ww["cohort"]),
    )
    graph = graph_feature(feat).source

    feat_name = feat.get_name()
    prim_node = "0_{}_cum_max".format(feat_name)
    groupby_node = "{}_groupby_customers--cohort".format(feat_name)
    dataframe_table = "\u2605 customers (target)"

    groupby_edge = 'customers:cohort -> "{}"'.format(groupby_node)
    groupby_input = 'customers:age -> "{}"'.format(groupby_node)
    prim_input = '"{}" -> "{}"'.format(groupby_node, prim_node)
    feat_edge = '"{}" -> customers:"{}"'.format(prim_node, feat_name)

    graph_components = [
        feat_name,
        prim_node,
        groupby_node,
        dataframe_table,
        groupby_edge,
        groupby_input,
        prim_input,
        feat_edge,
    ]
    for component in graph_components:
        assert component in graph

    matches = re.findall(r"customers \[label=<\n<TABLE.*?</TABLE>>", graph, re.DOTALL)
    assert len(matches) == 1
    rows = re.findall(r"<TR.*?</TR>", matches[0], re.DOTALL)
    assert len(rows) == 4
    assert dataframe_table in rows[0]
    assert feat_name in rows[-1]
    assert ("age" in rows[1] and "cohort" in rows[2]) or (
        "age" in rows[2] and "cohort" in rows[1]
    )


def test_groupby_transform_direct_groupby(es):
    groupby = DirectFeature(
        IdentityFeature(es["cohorts"].ww["cohort_name"]),
        "customers",
    )
    feat = GroupByTransformFeature(
        IdentityFeature(es["customers"].ww["age"]),
        CumMax,
        groupby,
    )
    graph = graph_feature(feat).source

    groupby_name = groupby.get_name()
    feat_name = feat.get_name()
    join_node = "1_{}_join".format(groupby_name)
    prim_node = "0_{}_cum_max".format(feat_name)
    groupby_node = "{}_groupby_customers--{}".format(feat_name, groupby_name)
    customers_table = "\u2605 customers (target)"
    cohorts_table = "cohorts"

    join_groupby = '"{}" -> customers:cohort'.format(join_node)
    join_input = 'cohorts:cohort_name -> "{}"'.format(join_node)
    join_out_edge = '"{}" -> customers:"{}"'.format(join_node, groupby_name)
    groupby_edge = 'customers:"{}" -> "{}"'.format(groupby_name, groupby_node)
    groupby_input = 'customers:age -> "{}"'.format(groupby_node)
    prim_input = '"{}" -> "{}"'.format(groupby_node, prim_node)
    feat_edge = '"{}" -> customers:"{}"'.format(prim_node, feat_name)

    graph_components = [
        groupby_name,
        feat_name,
        join_node,
        prim_node,
        groupby_node,
        customers_table,
        cohorts_table,
        join_groupby,
        join_input,
        join_out_edge,
        groupby_edge,
        groupby_input,
        prim_input,
        feat_edge,
    ]
    for component in graph_components:
        assert component in graph

    dataframes = {
        "cohorts": [cohorts_table, "cohort_name"],
        "customers": [customers_table, "cohort", "age", groupby_name, feat_name],
    }
    for dataframe in dataframes:
        regex = r"{} \[label=<\n<TABLE.*?</TABLE>>".format(dataframe)
        matches = re.findall(regex, graph, re.DOTALL)
        assert len(matches) == 1

        rows = re.findall(r"<TR.*?</TR>", matches[0], re.DOTALL)
        assert len(rows) == len(dataframes[dataframe])

        for row in rows:
            matched = False
            for i in dataframes[dataframe]:
                if i in row:
                    matched = True
                    dataframes[dataframe].remove(i)
                    break
            assert matched


def test_aggregation(es):
    feat = AggregationFeature(IdentityFeature(es["log"].ww["id"]), "sessions", Count)
    graph = graph_feature(feat).source

    feat_name = feat.get_name()
    prim_node = "0_{}_count".format(feat_name)
    groupby_node = "{}_groupby_log--session_id".format(feat_name)

    sessions_table = "\u2605 sessions (target)"
    log_table = "log"
    groupby_edge = 'log:session_id -> "{}"'.format(groupby_node)
    groupby_input = 'log:id -> "{}"'.format(groupby_node)
    prim_input = '"{}" -> "{}"'.format(groupby_node, prim_node)
    feat_edge = '"{}" -> sessions:"{}"'.format(prim_node, feat_name)

    graph_components = [
        feat_name,
        prim_node,
        groupby_node,
        sessions_table,
        log_table,
        groupby_edge,
        groupby_input,
        prim_input,
        feat_edge,
    ]

    for component in graph_components:
        assert component in graph

    dataframes = {
        "log": [log_table, "id", "session_id"],
        "sessions": [sessions_table, feat_name],
    }
    for dataframe in dataframes:
        regex = r"{} \[label=<\n<TABLE.*?</TABLE>>".format(dataframe)
        matches = re.findall(regex, graph, re.DOTALL)
        assert len(matches) == 1

        rows = re.findall(r"<TR.*?</TR>", matches[0], re.DOTALL)
        assert len(rows) == len(dataframes[dataframe])
        for row in rows:
            matched = False
            for i in dataframes[dataframe]:
                if i in row:
                    matched = True
                    dataframes[dataframe].remove(i)
                    break
            assert matched


def test_multioutput(es):
    multioutput = AggregationFeature(
        IdentityFeature(es["log"].ww["zipcode"]),
        "sessions",
        NMostCommon,
    )
    feat = FeatureOutputSlice(multioutput, 0)
    graph = graph_feature(feat).source

    feat_name = feat.get_name()
    prim_node = "0_{}_n_most_common".format(multioutput.get_name())
    groupby_node = "{}_groupby_log--session_id".format(multioutput.get_name())

    sessions_table = "\u2605 sessions (target)"
    log_table = "log"
    groupby_edge = 'log:session_id -> "{}"'.format(groupby_node)
    groupby_input = 'log:zipcode -> "{}"'.format(groupby_node)
    prim_input = '"{}" -> "{}"'.format(groupby_node, prim_node)
    feat_edge = '"{}" -> sessions:"{}"'.format(prim_node, feat_name)

    graph_components = [
        feat_name,
        prim_node,
        groupby_node,
        sessions_table,
        log_table,
        groupby_edge,
        groupby_input,
        prim_input,
        feat_edge,
    ]

    for component in graph_components:
        assert component in graph

    dataframes = {
        "log": [log_table, "zipcode", "session_id"],
        "sessions": [sessions_table, feat_name],
    }
    for dataframe in dataframes:
        regex = r"{} \[label=<\n<TABLE.*?</TABLE>>".format(dataframe)
        matches = re.findall(regex, graph, re.DOTALL)
        assert len(matches) == 1

        rows = re.findall(r"<TR.*?</TR>", matches[0], re.DOTALL)
        assert len(rows) == len(dataframes[dataframe])
        for row in rows:
            matched = False
            for i in dataframes[dataframe]:
                if i in row:
                    matched = True
                    dataframes[dataframe].remove(i)
                    break
            assert matched


def test_direct(es):
    d1 = DirectFeature(
        IdentityFeature(es["customers"].ww["engagement_level"]),
        "sessions",
    )
    d2 = DirectFeature(d1, "log")
    graph = graph_feature(d2).source

    d1_name = d1.get_name()
    d2_name = d2.get_name()
    prim_node1 = "1_{}_join".format(d1_name)
    prim_node2 = "0_{}_join".format(d2_name)

    log_table = "\u2605 log (target)"
    sessions_table = "sessions"
    customers_table = "customers"
    groupby_edge1 = '"{}" -> sessions:customer_id'.format(prim_node1)
    groupby_edge2 = '"{}" -> log:session_id'.format(prim_node2)
    groupby_input1 = 'customers:engagement_level -> "{}"'.format(prim_node1)
    groupby_input2 = 'sessions:"{}" -> "{}"'.format(d1_name, prim_node2)
    d1_edge = '"{}" -> sessions:"{}"'.format(prim_node1, d1_name)
    d2_edge = '"{}" -> log:"{}"'.format(prim_node2, d2_name)

    graph_components = [
        d1_name,
        d2_name,
        prim_node1,
        prim_node2,
        log_table,
        sessions_table,
        customers_table,
        groupby_edge1,
        groupby_edge2,
        groupby_input1,
        groupby_input2,
        d1_edge,
        d2_edge,
    ]
    for component in graph_components:
        assert component in graph

    dataframes = {
        "customers": [customers_table, "engagement_level"],
        "sessions": [sessions_table, "customer_id", d1_name],
        "log": [log_table, "session_id", d2_name],
    }

    for dataframe in dataframes:
        regex = r"{} \[label=<\n<TABLE.*?</TABLE>>".format(dataframe)
        matches = re.findall(regex, graph, re.DOTALL)
        assert len(matches) == 1

        rows = re.findall(r"<TR.*?</TR>", matches[0], re.DOTALL)
        assert len(rows) == len(dataframes[dataframe])
        for row in rows:
            matched = False
            for i in dataframes[dataframe]:
                if i in row:
                    matched = True
                    dataframes[dataframe].remove(i)
                    break
            assert matched


def test_stacked(es, trans_feat):
    stacked = AggregationFeature(trans_feat, "cohorts", Mode)
    graph = graph_feature(stacked).source

    feat_name = stacked.get_name()
    intermediate_name = trans_feat.get_name()
    agg_primitive = "0_{}_mode".format(feat_name)
    trans_primitive = "1_{}_year".format(intermediate_name)
    groupby_node = "{}_groupby_customers--cohort".format(feat_name)

    trans_prim_edge = 'customers:cancel_date -> "{}"'.format(trans_primitive)
    intermediate_edge = '"{}" -> customers:"{}"'.format(
        trans_primitive,
        intermediate_name,
    )
    groupby_edge = 'customers:cohort -> "{}"'.format(groupby_node)
    groupby_input = 'customers:"{}" -> "{}"'.format(intermediate_name, groupby_node)
    agg_input = '"{}" -> "{}"'.format(groupby_node, agg_primitive)
    feat_edge = '"{}" -> cohorts:"{}"'.format(agg_primitive, feat_name)

    graph_components = [
        feat_name,
        intermediate_name,
        agg_primitive,
        trans_primitive,
        groupby_node,
        trans_prim_edge,
        intermediate_edge,
        groupby_edge,
        groupby_input,
        agg_input,
        feat_edge,
    ]
    for component in graph_components:
        assert component in graph

    agg_primitive = agg_primitive.replace("(", "\\(").replace(")", "\\)")
    agg_node = re.findall('"{}" \\[label.*'.format(agg_primitive), graph)
    assert len(agg_node) == 1
    assert "Step 2" in agg_node[0]

    trans_primitive = trans_primitive.replace("(", "\\(").replace(")", "\\)")
    trans_node = re.findall('"{}" \\[label.*'.format(trans_primitive), graph)
    assert len(trans_node) == 1
    assert "Step 1" in trans_node[0]


def test_description_auto_caption(trans_feat):
    default_graph = graph_feature(trans_feat, description=True).source
    default_label = 'label="The year of the \\"cancel_date\\"."'
    assert default_label in default_graph


def test_description_auto_caption_metadata(trans_feat, tmp_path):
    feature_descriptions = {"customers: cancel_date": "the date the customer cancelled"}
    primitive_templates = {"year": "the year that {} occurred"}
    metadata_graph = graph_feature(
        trans_feat,
        description=True,
        feature_descriptions=feature_descriptions,
        primitive_templates=primitive_templates,
    ).source

    metadata_label = 'label="The year that the date the customer cancelled occurred."'
    assert metadata_label in metadata_graph

    metadata = {
        "feature_descriptions": feature_descriptions,
        "primitive_templates": primitive_templates,
    }
    metadata_path = os.path.join(tmp_path, "description_metadata.json")
    with open(metadata_path, "w") as f:
        json.dump(metadata, f)
    json_metadata_graph = graph_feature(
        trans_feat,
        description=True,
        metadata_file=metadata_path,
    ).source
    assert metadata_label in json_metadata_graph


def test_description_custom_caption(trans_feat):
    custom_description = "A custom feature description"
    custom_description_graph = graph_feature(
        trans_feat,
        description=custom_description,
    ).source
    custom_description_label = 'label="A custom feature description"'
    assert custom_description_label in custom_description_graph


================================================
FILE: featuretools/tests/primitive_tests/test_features_deserializer.py
================================================
import logging
from unittest.mock import patch

import pandas as pd
import pytest

from featuretools import (
    AggregationFeature,
    Feature,
    IdentityFeature,
    TransformFeature,
    __version__,
)
from featuretools.feature_base.features_deserializer import FeaturesDeserializer
from featuretools.primitives import (
    Count,
    Max,
    MultiplyNumericScalar,
    NMostCommon,
    NumberOfCommonWords,
    NumUnique,
)
from featuretools.primitives.utils import serialize_primitive
from featuretools.utils.schema_utils import FEATURES_SCHEMA_VERSION


def test_single_feature(es):
    feature = IdentityFeature(es["log"].ww["value"])
    dictionary = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [feature.unique_name()],
        "feature_definitions": {feature.unique_name(): feature.to_dictionary()},
        "primitive_definitions": {},
    }
    deserializer = FeaturesDeserializer(dictionary)

    expected = [feature]
    assert expected == deserializer.to_list()


def test_multioutput_feature(es):
    value = IdentityFeature(es["log"].ww["product_id"])
    threecommon = NMostCommon()
    num_unique = NumUnique()
    tc = Feature(value, parent_dataframe_name="sessions", primitive=threecommon)

    features = [tc, value]
    for i in range(3):
        features.append(
            Feature(
                tc[i],
                parent_dataframe_name="customers",
                primitive=num_unique,
            ),
        )
        features.append(tc[i])

    flist = [feat.unique_name() for feat in features]
    fd = [feat.to_dictionary() for feat in features]
    fdict = dict(zip(flist, fd))

    dictionary = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": flist,
        "feature_definitions": fdict,
    }
    dictionary["primitive_definitions"] = {
        "0": serialize_primitive(threecommon),
        "1": serialize_primitive(num_unique),
    }

    dictionary["feature_definitions"][flist[0]]["arguments"]["primitive"] = "0"
    dictionary["feature_definitions"][flist[2]]["arguments"]["primitive"] = "1"
    dictionary["feature_definitions"][flist[4]]["arguments"]["primitive"] = "1"
    dictionary["feature_definitions"][flist[6]]["arguments"]["primitive"] = "1"
    deserializer = FeaturesDeserializer(dictionary).to_list()

    for i in range(len(features)):
        assert features[i].unique_name() == deserializer[i].unique_name()


def test_base_features_in_list(es):
    max_primitive = Max()
    value = IdentityFeature(es["log"].ww["value"])
    max_feat = AggregationFeature(value, "sessions", max_primitive)
    dictionary = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [max_feat.unique_name(), value.unique_name()],
        "feature_definitions": {
            max_feat.unique_name(): max_feat.to_dictionary(),
            value.unique_name(): value.to_dictionary(),
        },
    }
    dictionary["primitive_definitions"] = {"0": serialize_primitive(max_primitive)}
    dictionary["feature_definitions"][max_feat.unique_name()]["arguments"][
        "primitive"
    ] = "0"
    deserializer = FeaturesDeserializer(dictionary)

    expected = [max_feat, value]
    assert expected == deserializer.to_list()


def test_base_features_not_in_list(es):
    max_primitive = Max()
    mult_primitive = MultiplyNumericScalar(value=2)
    value = IdentityFeature(es["log"].ww["value"])
    value_x2 = TransformFeature(value, mult_primitive)
    max_feat = AggregationFeature(value_x2, "sessions", max_primitive)
    dictionary = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [max_feat.unique_name()],
        "feature_definitions": {
            max_feat.unique_name(): max_feat.to_dictionary(),
            value_x2.unique_name(): value_x2.to_dictionary(),
            value.unique_name(): value.to_dictionary(),
        },
    }
    dictionary["primitive_definitions"] = {
        "0": serialize_primitive(max_primitive),
        "1": serialize_primitive(mult_primitive),
    }
    dictionary["feature_definitions"][max_feat.unique_name()]["arguments"][
        "primitive"
    ] = "0"
    dictionary["feature_definitions"][value_x2.unique_name()]["arguments"][
        "primitive"
    ] = "1"
    deserializer = FeaturesDeserializer(dictionary)

    expected = [max_feat]
    assert expected == deserializer.to_list()


@patch("featuretools.utils.schema_utils.FEATURES_SCHEMA_VERSION", "1.1.1")
@pytest.mark.parametrize(
    "hardcoded_schema_version, warns",
    [("2.1.1", True), ("1.2.1", True), ("1.1.2", True), ("1.0.2", False)],
)
def test_later_schema_version(es, caplog, hardcoded_schema_version, warns):
    def test_version(version, warns):
        if warns:
            warning_text = (
                "The schema version of the saved features"
                "(%s) is greater than the latest supported (%s). "
                "You may need to upgrade featuretools. Attempting to load features ..."
                % (version, "1.1.1")
            )
        else:
            warning_text = None

        _check_schema_version(version, es, warning_text, caplog, "warn")

    test_version(hardcoded_schema_version, warns)


@patch("featuretools.utils.schema_utils.FEATURES_SCHEMA_VERSION", "1.1.1")
@pytest.mark.parametrize(
    "hardcoded_schema_version, warns",
    [("0.1.1", True), ("1.0.1", False), ("1.1.0", False)],
)
def test_earlier_schema_version(es, caplog, hardcoded_schema_version, warns):
    def test_version(version, warns):
        if warns:
            warning_text = (
                "The schema version of the saved features"
                "(%s) is no longer supported by this version "
                "of featuretools. Attempting to load features ..." % version
            )
        else:
            warning_text = None

        _check_schema_version(version, es, warning_text, caplog, "log")

    test_version(hardcoded_schema_version, warns)


def test_unknown_feature_type(es):
    dictionary = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": ["feature_1"],
        "feature_definitions": {
            "feature_1": {"type": "FakeFeature", "dependencies": [], "arguments": {}},
        },
        "primitive_definitions": {},
    }

    deserializer = FeaturesDeserializer(dictionary)

    with pytest.raises(RuntimeError, match='Unrecognized feature type "FakeFeature"'):
        deserializer.to_list()


def test_unknown_primitive_type(es):
    value = IdentityFeature(es["log"].ww["value"])
    max_feat = AggregationFeature(value, "sessions", Max)
    primitive_dict = serialize_primitive(Max())
    primitive_dict["type"] = "FakePrimitive"
    dictionary = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [max_feat.unique_name(), value.unique_name()],
        "feature_definitions": {
            max_feat.unique_name(): max_feat.to_dictionary(),
            value.unique_name(): value.to_dictionary(),
        },
        "primitive_definitions": {"0": primitive_dict},
    }

    with pytest.raises(RuntimeError) as excinfo:
        FeaturesDeserializer(dictionary)

    error_text = 'Primitive "FakePrimitive" in module "%s" not found' % Max.__module__
    assert error_text == str(excinfo.value)


def test_unknown_primitive_module(es):
    value = IdentityFeature(es["log"].ww["value"])
    max_feat = AggregationFeature(value, "sessions", Max)
    primitive_dict = serialize_primitive(Max())
    primitive_dict["module"] = "fake.module"
    dictionary = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [max_feat.unique_name(), value.unique_name()],
        "feature_definitions": {
            max_feat.unique_name(): max_feat.to_dictionary(),
            value.unique_name(): value.to_dictionary(),
        },
        "primitive_definitions": {"0": primitive_dict},
    }

    with pytest.raises(RuntimeError) as excinfo:
        FeaturesDeserializer(dictionary)

    error_text = 'Primitive "Max" in module "fake.module" not found'
    assert error_text == str(excinfo.value)


def test_feature_use_previous_pd_timedelta(es):
    value = IdentityFeature(es["log"].ww["id"])
    td = pd.Timedelta(12, "W")
    count_primitive = Count()
    count_feature = AggregationFeature(
        value,
        "customers",
        count_primitive,
        use_previous=td,
    )
    dictionary = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [count_feature.unique_name(), value.unique_name()],
        "feature_definitions": {
            count_feature.unique_name(): count_feature.to_dictionary(),
            value.unique_name(): value.to_dictionary(),
        },
    }
    dictionary["primitive_definitions"] = {"0": serialize_primitive(count_primitive)}
    dictionary["feature_definitions"][count_feature.unique_name()]["arguments"][
        "primitive"
    ] = "0"
    deserializer = FeaturesDeserializer(dictionary)

    expected = [count_feature, value]
    assert expected == deserializer.to_list()


def test_feature_use_previous_pd_dateoffset(es):
    value = IdentityFeature(es["log"].ww["id"])
    do = pd.DateOffset(months=3)
    count_primitive = Count()
    count_feature = AggregationFeature(
        value,
        "customers",
        count_primitive,
        use_previous=do,
    )
    dictionary = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [count_feature.unique_name(), value.unique_name()],
        "feature_definitions": {
            count_feature.unique_name(): count_feature.to_dictionary(),
            value.unique_name(): value.to_dictionary(),
        },
    }
    dictionary["primitive_definitions"] = {"0": serialize_primitive(count_primitive)}
    dictionary["feature_definitions"][count_feature.unique_name()]["arguments"][
        "primitive"
    ] = "0"
    deserializer = FeaturesDeserializer(dictionary)

    expected = [count_feature, value]
    assert expected == deserializer.to_list()

    value = IdentityFeature(es["log"].ww["id"])
    do = pd.DateOffset(months=3, days=2, minutes=30)
    count_feature = AggregationFeature(
        value,
        "customers",
        count_primitive,
        use_previous=do,
    )
    dictionary = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [count_feature.unique_name(), value.unique_name()],
        "feature_definitions": {
            count_feature.unique_name(): count_feature.to_dictionary(),
            value.unique_name(): value.to_dictionary(),
        },
    }
    dictionary["primitive_definitions"] = {"0": serialize_primitive(count_primitive)}
    dictionary["feature_definitions"][count_feature.unique_name()]["arguments"][
        "primitive"
    ] = "0"
    deserializer = FeaturesDeserializer(dictionary)

    expected = [count_feature, value]
    assert expected == deserializer.to_list()


def test_word_set_in_number_of_common_words_is_deserialized_back_into_a_set(es):
    id_feat = IdentityFeature(es["log"].ww["comments"])
    number_of_common_words = NumberOfCommonWords(word_set={"hello", "my"})
    transform_feat = TransformFeature(id_feat, number_of_common_words)
    dictionary = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [id_feat.unique_name(), transform_feat.unique_name()],
        "feature_definitions": {
            id_feat.unique_name(): id_feat.to_dictionary(),
            transform_feat.unique_name(): transform_feat.to_dictionary(),
        },
        "primitive_definitions": {"0": serialize_primitive(number_of_common_words)},
    }
    dictionary["feature_definitions"][transform_feat.unique_name()]["arguments"][
        "primitive"
    ] = "0"
    deserializer = FeaturesDeserializer(dictionary)
    assert isinstance(
        deserializer.features_dict["primitive_definitions"]["0"]["arguments"][
            "word_set"
        ],
        set,
    )


def _check_schema_version(version, es, warning_text, caplog, warning_type=None):
    dictionary = {
        "ft_version": __version__,
        "schema_version": version,
        "entityset": es.to_dictionary(),
        "feature_list": [],
        "feature_definitions": {},
        "primitive_definitions": {},
    }

    if warning_type == "warn" and warning_text:
        with pytest.warns(UserWarning) as record:
            FeaturesDeserializer(dictionary)
        assert record[0].message.args[0] == warning_text
    elif warning_type == "log":
        logger = logging.getLogger("featuretools")
        logger.propagate = True
        FeaturesDeserializer(dictionary)
        if warning_text:
            assert warning_text in caplog.text
        else:
            assert not len(caplog.text)
        logger.propagate = False


================================================
FILE: featuretools/tests/primitive_tests/test_features_serializer.py
================================================
import pandas as pd

from featuretools import (
    AggregationFeature,
    Feature,
    IdentityFeature,
    TransformFeature,
    __version__,
)
from featuretools.entityset.deserialize import description_to_entityset
from featuretools.feature_base.features_serializer import FeaturesSerializer
from featuretools.primitives import (
    Count,
    Max,
    MultiplyNumericScalar,
    NMostCommon,
    NumUnique,
)
from featuretools.primitives.utils import serialize_primitive
from featuretools.version import FEATURES_SCHEMA_VERSION


def test_single_feature(es):
    feature = IdentityFeature(es["log"].ww["value"])
    serializer = FeaturesSerializer([feature])

    expected = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [feature.unique_name()],
        "feature_definitions": {feature.unique_name(): feature.to_dictionary()},
        "primitive_definitions": {},
    }

    _compare_feature_dicts(expected, serializer.to_dict())


def test_base_features_in_list(es):
    value = IdentityFeature(es["log"].ww["value"])
    max_feature = AggregationFeature(value, "sessions", Max)
    features = [max_feature, value]
    serializer = FeaturesSerializer(features)

    expected = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [max_feature.unique_name(), value.unique_name()],
        "feature_definitions": {
            max_feature.unique_name(): max_feature.to_dictionary(),
            value.unique_name(): value.to_dictionary(),
        },
    }
    expected["primitive_definitions"] = {
        "0": serialize_primitive(max_feature.primitive),
    }
    expected["feature_definitions"][max_feature.unique_name()]["arguments"][
        "primitive"
    ] = "0"

    actual = serializer.to_dict()
    _compare_feature_dicts(expected, actual)


def test_multi_output_features(es):
    product_id = IdentityFeature(es["log"].ww["product_id"])
    threecommon = NMostCommon()
    num_unique = NumUnique()
    tc = Feature(product_id, parent_dataframe_name="sessions", primitive=threecommon)

    features = [tc, product_id]
    for i in range(3):
        features.append(
            Feature(
                tc[i],
                parent_dataframe_name="customers",
                primitive=num_unique,
            ),
        )
        features.append(tc[i])

    serializer = FeaturesSerializer(features)

    flist = [feat.unique_name() for feat in features]
    fd = [feat.to_dictionary() for feat in features]
    fdict = dict(zip(flist, fd))

    expected = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": flist,
        "feature_definitions": fdict,
    }
    expected["primitive_definitions"] = {
        "0": serialize_primitive(tc.primitive),
        "1": serialize_primitive(features[2].primitive),
    }

    expected["feature_definitions"][flist[0]]["arguments"]["primitive"] = "0"
    expected["feature_definitions"][flist[2]]["arguments"]["primitive"] = "1"
    expected["feature_definitions"][flist[4]]["arguments"]["primitive"] = "1"
    expected["feature_definitions"][flist[6]]["arguments"]["primitive"] = "1"

    actual = serializer.to_dict()
    _compare_feature_dicts(expected, actual)


def test_base_features_not_in_list(es):
    max_primitive = Max()
    mult_primitive = MultiplyNumericScalar(value=2)
    value = IdentityFeature(es["log"].ww["value"])
    value_x2 = TransformFeature(value, mult_primitive)
    max_feature = AggregationFeature(value_x2, "sessions", max_primitive)
    features = [max_feature]
    serializer = FeaturesSerializer(features)

    expected = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [max_feature.unique_name()],
        "feature_definitions": {
            max_feature.unique_name(): max_feature.to_dictionary(),
            value_x2.unique_name(): value_x2.to_dictionary(),
            value.unique_name(): value.to_dictionary(),
        },
    }
    expected["primitive_definitions"] = {
        "0": serialize_primitive(max_feature.primitive),
        "1": serialize_primitive(value_x2.primitive),
    }
    expected["feature_definitions"][max_feature.unique_name()]["arguments"][
        "primitive"
    ] = "0"
    expected["feature_definitions"][value_x2.unique_name()]["arguments"][
        "primitive"
    ] = "1"

    actual = serializer.to_dict()
    _compare_feature_dicts(expected, actual)


def test_where_feature_dependency(es):
    max_primitive = Max()
    value = IdentityFeature(es["log"].ww["value"])
    is_purchased = IdentityFeature(es["log"].ww["purchased"])
    max_feature = AggregationFeature(
        value,
        "sessions",
        max_primitive,
        where=is_purchased,
    )
    features = [max_feature]
    serializer = FeaturesSerializer(features)

    expected = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [max_feature.unique_name()],
        "feature_definitions": {
            max_feature.unique_name(): max_feature.to_dictionary(),
            value.unique_name(): value.to_dictionary(),
            is_purchased.unique_name(): is_purchased.to_dictionary(),
        },
    }
    expected["primitive_definitions"] = {
        "0": serialize_primitive(max_feature.primitive),
    }
    expected["feature_definitions"][max_feature.unique_name()]["arguments"][
        "primitive"
    ] = "0"

    actual = serializer.to_dict()
    _compare_feature_dicts(expected, actual)


def test_feature_use_previous_pd_timedelta(es):
    value = IdentityFeature(es["log"].ww["id"])
    td = pd.Timedelta(12, "W")
    count_primitive = Count()
    count_feature = AggregationFeature(
        value,
        "customers",
        count_primitive,
        use_previous=td,
    )
    features = [count_feature, value]
    serializer = FeaturesSerializer(features)

    expected = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [count_feature.unique_name(), value.unique_name()],
        "feature_definitions": {
            count_feature.unique_name(): count_feature.to_dictionary(),
            value.unique_name(): value.to_dictionary(),
        },
    }
    expected["primitive_definitions"] = {
        "0": serialize_primitive(count_feature.primitive),
    }
    expected["feature_definitions"][count_feature.unique_name()]["arguments"][
        "primitive"
    ] = "0"

    actual = serializer.to_dict()
    _compare_feature_dicts(expected, actual)


def test_feature_use_previous_pd_dateoffset(es):
    value = IdentityFeature(es["log"].ww["id"])
    do = pd.DateOffset(months=3)
    count_primitive = Count()
    count_feature = AggregationFeature(
        value,
        "customers",
        count_primitive,
        use_previous=do,
    )
    features = [count_feature, value]
    serializer = FeaturesSerializer(features)

    expected = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [count_feature.unique_name(), value.unique_name()],
        "feature_definitions": {
            count_feature.unique_name(): count_feature.to_dictionary(),
            value.unique_name(): value.to_dictionary(),
        },
    }
    expected["primitive_definitions"] = {
        "0": serialize_primitive(count_feature.primitive),
    }
    expected["feature_definitions"][count_feature.unique_name()]["arguments"][
        "primitive"
    ] = "0"

    actual = serializer.to_dict()
    _compare_feature_dicts(expected, actual)

    value = IdentityFeature(es["log"].ww["id"])
    do = pd.DateOffset(months=3, days=2, minutes=30)
    count_feature = AggregationFeature(
        value,
        "customers",
        count_primitive,
        use_previous=do,
    )
    features = [count_feature, value]
    serializer = FeaturesSerializer(features)

    expected = {
        "ft_version": __version__,
        "schema_version": FEATURES_SCHEMA_VERSION,
        "entityset": es.to_dictionary(),
        "feature_list": [count_feature.unique_name(), value.unique_name()],
        "feature_definitions": {
            count_feature.unique_name(): count_feature.to_dictionary(),
            value.unique_name(): value.to_dictionary(),
        },
    }
    expected["primitive_definitions"] = {
        "0": serialize_primitive(count_feature.primitive),
    }
    expected["feature_definitions"][count_feature.unique_name()]["arguments"][
        "primitive"
    ] = "0"
    actual = serializer.to_dict()
    _compare_feature_dicts(expected, actual)


def _compare_feature_dicts(a_dict, b_dict):
    # We can't compare entityset dictionaries because column lists are not
    # guaranteed to be in the same order.
    es_a = description_to_entityset(a_dict.pop("entityset"))
    es_b = description_to_entityset(b_dict.pop("entityset"))
    assert es_a == es_b

    assert a_dict == b_dict


================================================
FILE: featuretools/tests/primitive_tests/test_groupby_transform_primitives.py
================================================
import numpy as np
import pandas as pd
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime

from featuretools import (
    Feature,
    GroupByTransformFeature,
    IdentityFeature,
    calculate_feature_matrix,
    feature_base,
)
from featuretools.computational_backends.feature_set import FeatureSet
from featuretools.computational_backends.feature_set_calculator import (
    FeatureSetCalculator,
)
from featuretools.primitives import CumCount, CumMax, CumMean, CumMin, CumSum, Last
from featuretools.primitives.base import TransformPrimitive
from featuretools.synthesis import dfs
from featuretools.tests.testing_utils import feature_with_name


class TestCumCount:
    primitive = CumCount

    def test_order(self):
        g = pd.Series(["a", "b", "a"])

        answers = ([1, 2], [1])

        function = self.primitive().get_function()
        for (_, group), answer in zip(g.groupby(g), answers):
            np.testing.assert_array_equal(function(group), answer)

    def test_regular(self):
        g = pd.Series(["a", "b", "a", "c", "d", "b"])
        answers = ([1, 2], [1, 2], [1], [1])

        function = self.primitive().get_function()
        for (_, group), answer in zip(g.groupby(g), answers):
            np.testing.assert_array_equal(function(group), answer)

    def test_discrete(self):
        g = pd.Series(["a", "b", "a", "c", "d", "b"])
        answers = ([1, 2], [1, 2], [1], [1])

        function = self.primitive().get_function()
        for (_, group), answer in zip(g.groupby(g), answers):
            np.testing.assert_array_equal(function(group), answer)


class TestCumSum:
    primitive = CumSum

    def test_order(self):
        v = pd.Series([1, 2, 2])
        g = pd.Series(["a", "b", "a"])

        answers = ([1, 3], [2])

        function = self.primitive().get_function()
        for (_, group), answer in zip(v.groupby(g), answers):
            np.testing.assert_array_equal(function(group), answer)

    def test_regular(self):
        v = pd.Series([101, 102, 103, 104, 105, 106])
        g = pd.Series(["a", "b", "a", "c", "d", "b"])
        answers = ([101, 204], [102, 208], [104], [105])

        function = self.primitive().get_function()
        for (_, group), answer in zip(v.groupby(g), answers):
            np.testing.assert_array_equal(function(group), answer)


class TestCumMean:
    primitive = CumMean

    def test_order(self):
        v = pd.Series([1, 2, 2])
        g = pd.Series(["a", "b", "a"])

        answers = ([1, 1.5], [2])

        function = self.primitive().get_function()
        for (_, group), answer in zip(v.groupby(g), answers):
            np.testing.assert_array_equal(function(group), answer)

    def test_regular(self):
        v = pd.Series([101, 102, 103, 104, 105, 106])
        g = pd.Series(["a", "b", "a", "c", "d", "b"])
        answers = ([101, 102], [102, 104], [104], [105])

        function = self.primitive().get_function()
        for (_, group), answer in zip(v.groupby(g), answers):
            np.testing.assert_array_equal(function(group), answer)


class TestCumMax:
    primitive = CumMax

    def test_order(self):
        v = pd.Series([1, 2, 2])
        g = pd.Series(["a", "b", "a"])

        answers = ([1, 2], [2])

        function = self.primitive().get_function()
        for (_, group), answer in zip(v.groupby(g), answers):
            np.testing.assert_array_equal(function(group), answer)

    def test_regular(self):
        v = pd.Series([101, 102, 103, 104, 105, 106])
        g = pd.Series(["a", "b", "a", "c", "d", "b"])
        answers = ([101, 103], [102, 106], [104], [105])

        function = self.primitive().get_function()
        for (_, group), answer in zip(v.groupby(g), answers):
            np.testing.assert_array_equal(function(group), answer)


class TestCumMin:
    primitive = CumMin

    def test_order(self):
        v = pd.Series([1, 2, 2])
        g = pd.Series(["a", "b", "a"])

        answers = ([1, 1], [2])

        function = self.primitive().get_function()
        for (_, group), answer in zip(v.groupby(g), answers):
            np.testing.assert_array_equal(function(group), answer)

    def test_regular(self):
        v = pd.Series([101, 102, 103, 104, 105, 106, 100])
        g = pd.Series(["a", "b", "a", "c", "d", "b", "a"])
        answers = ([101, 101, 100], [102, 102], [104], [105])

        function = self.primitive().get_function()
        for (_, group), answer in zip(v.groupby(g), answers):
            np.testing.assert_array_equal(function(group), answer)


def test_cum_sum(es):
    log_value_feat = IdentityFeature(es["log"].ww["value"])
    dfeat = Feature(
        IdentityFeature(es["sessions"].ww["device_type"]),
        dataframe_name="log",
    )
    cum_sum = Feature(log_value_feat, groupby=dfeat, primitive=CumSum)
    features = [cum_sum]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(15),
    )
    cvalues = df[cum_sum.get_name()].values
    assert len(cvalues) == 15
    cum_sum_values = [0, 5, 15, 30, 50, 0, 1, 3, 6, 6, 50, 55, 55, 62, 76]
    for i, v in enumerate(cum_sum_values):
        assert v == cvalues[i]


def test_cum_min(es):
    log_value_feat = IdentityFeature(es["log"].ww["value"])
    cum_min = Feature(
        log_value_feat,
        groupby=IdentityFeature(es["log"].ww["session_id"]),
        primitive=CumMin,
    )
    features = [cum_min]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(15),
    )
    cvalues = df[cum_min.get_name()].values
    assert len(cvalues) == 15
    cum_min_values = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    for i, v in enumerate(cum_min_values):
        assert v == cvalues[i]


def test_cum_max(es):
    log_value_feat = IdentityFeature(es["log"].ww["value"])
    cum_max = Feature(
        log_value_feat,
        groupby=IdentityFeature(es["log"].ww["session_id"]),
        primitive=CumMax,
    )
    features = [cum_max]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(15),
    )
    cvalues = df[cum_max.get_name()].values
    assert len(cvalues) == 15
    cum_max_values = [0, 5, 10, 15, 20, 0, 1, 2, 3, 0, 0, 5, 0, 7, 14]
    for i, v in enumerate(cum_max_values):
        assert v == cvalues[i]


def test_cum_sum_group_on_nan(es):
    log_value_feat = IdentityFeature(es["log"].ww["value"])
    es["log"]["product_id"] = (
        ["coke zero"] * 3
        + ["car"] * 2
        + ["toothpaste"] * 3
        + ["brown bag"] * 2
        + ["shoes"]
        + [np.nan] * 4
        + ["coke_zero"] * 2
    )
    es["log"]["value"][16] = 10
    cum_sum = Feature(
        log_value_feat,
        groupby=IdentityFeature(es["log"].ww["product_id"]),
        primitive=CumSum,
    )
    features = [cum_sum]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(17),
    )
    cvalues = df[cum_sum.get_name()].values
    assert len(cvalues) == 17
    cum_sum_values = [
        0,
        5,
        15,
        15,
        35,
        0,
        1,
        3,
        3,
        3,
        0,
        np.nan,
        np.nan,
        np.nan,
        np.nan,
        np.nan,
        10,
    ]

    assert len(cvalues) == len(cum_sum_values)
    for i, v in enumerate(cum_sum_values):
        if np.isnan(v):
            assert np.isnan(cvalues[i])
        else:
            assert v == cvalues[i]


def test_cum_sum_numpy_group_on_nan(es):
    class CumSumNumpy(TransformPrimitive):
        """Returns the cumulative sum after grouping"""

        name = "cum_sum"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        uses_full_dataframe = True

        def get_function(self):
            def cum_sum(values):
                return values.cumsum().values

            return cum_sum

    log_value_feat = IdentityFeature(es["log"].ww["value"])
    es["log"]["product_id"] = (
        ["coke zero"] * 3
        + ["car"] * 2
        + ["toothpaste"] * 3
        + ["brown bag"] * 2
        + ["shoes"]
        + [np.nan] * 4
        + ["coke_zero"] * 2
    )
    es["log"]["value"][16] = 10
    cum_sum = Feature(
        log_value_feat,
        groupby=IdentityFeature(es["log"].ww["product_id"]),
        primitive=CumSumNumpy,
    )
    assert cum_sum.get_name() == "CUM_SUM(value) by product_id"
    features = [cum_sum]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(17),
    )
    cvalues = df[cum_sum.get_name()].values
    assert len(cvalues) == 17
    cum_sum_values = [
        0,
        5,
        15,
        15,
        35,
        0,
        1,
        3,
        3,
        3,
        0,
        np.nan,
        np.nan,
        np.nan,
        np.nan,
        np.nan,
        10,
    ]

    assert len(cvalues) == len(cum_sum_values)
    for i, v in enumerate(cum_sum_values):
        if np.isnan(v):
            assert np.isnan(cvalues[i])
        else:
            assert v == cvalues[i]


def test_cum_handles_uses_full_dataframe(es):
    def check(feature):
        feature_set = FeatureSet([feature])
        calculator = FeatureSetCalculator(
            es,
            feature_set=feature_set,
            time_last=None,
        )
        df_1 = calculator.run(np.array([0, 1, 2]))
        df_2 = calculator.run(np.array([2, 4]))

        # check that the value for instance id 2 matches
        assert (df_2.loc[2] == df_1.loc[2]).all()

    for primitive in [CumSum, CumMean, CumMax, CumMin]:
        check(
            Feature(
                es["log"].ww["value"],
                groupby=IdentityFeature(es["log"].ww["session_id"]),
                primitive=primitive,
            ),
        )

    check(
        Feature(
            es["log"].ww["product_id"],
            groupby=Feature(es["log"].ww["product_id"]),
            primitive=CumCount,
        ),
    )


def test_cum_mean(es):
    log_value_feat = IdentityFeature(es["log"].ww["value"])
    cum_mean = Feature(
        log_value_feat,
        groupby=IdentityFeature(es["log"].ww["session_id"]),
        primitive=CumMean,
    )
    features = [cum_mean]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(15),
    )
    cvalues = df[cum_mean.get_name()].values
    assert len(cvalues) == 15
    cum_mean_values = [0, 2.5, 5, 7.5, 10, 0, 0.5, 1, 1.5, 0, 0, 2.5, 0, 3.5, 7]
    for i, v in enumerate(cum_mean_values):
        assert v == cvalues[i]


def test_cum_count(es):
    cum_count = Feature(
        IdentityFeature(es["log"].ww["product_id"]),
        groupby=IdentityFeature(es["log"].ww["product_id"]),
        primitive=CumCount,
    )
    features = [cum_count]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(15),
    )
    cvalues = df[cum_count.get_name()].values
    assert len(cvalues) == 15
    cum_count_values = [1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 4, 5, 6, 7]
    for i, v in enumerate(cum_count_values):
        assert v == cvalues[i]


def test_rename(es):
    cum_count = Feature(
        IdentityFeature(es["log"].ww["product_id"]),
        groupby=IdentityFeature(es["log"].ww["product_id"]),
        primitive=CumCount,
    )
    copy_feat = cum_count.rename("rename_test")
    assert cum_count.unique_name() != copy_feat.unique_name()
    assert cum_count.get_name() != copy_feat.get_name()
    assert all(
        [
            x.generate_name() == y.generate_name()
            for x, y in zip(cum_count.base_features, copy_feat.base_features)
        ],
    )
    assert cum_count.dataframe_name == copy_feat.dataframe_name


def test_groupby_no_data(es):
    cum_count = Feature(
        IdentityFeature(es["log"].ww["product_id"]),
        groupby=IdentityFeature(es["log"].ww["product_id"]),
        primitive=CumCount,
    )
    last_feat = Feature(cum_count, parent_dataframe_name="customers", primitive=Last)
    df = calculate_feature_matrix(
        entityset=es,
        features=[last_feat],
        cutoff_time=pd.Timestamp("2011-04-08"),
    )
    cvalues = df[last_feat.get_name()].values
    assert len(cvalues) == 2
    assert all([pd.isnull(value) for value in cvalues])


def test_groupby_uses_calc_time(es):
    def projected_amount_left(amount, timestamp, time=None):
        # cumulative sum of amount, with timedelta *  constant subtracted
        delta = time - timestamp
        delta_seconds = delta / np.timedelta64(1, "s")
        return amount.cumsum() - (delta_seconds)

    class ProjectedAmountRemaining(TransformPrimitive):
        name = "projected_amount_remaining"
        uses_calc_time = True
        input_types = [
            ColumnSchema(semantic_tags={"numeric"}),
            ColumnSchema(logical_type=Datetime, semantic_tags={"time_index"}),
        ]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        uses_full_dataframe = True

        def get_function(self):
            return projected_amount_left

    time_since_product = GroupByTransformFeature(
        [
            IdentityFeature(es["log"].ww["value"]),
            IdentityFeature(es["log"].ww["datetime"]),
        ],
        groupby=IdentityFeature(es["log"].ww["product_id"]),
        primitive=ProjectedAmountRemaining,
    )
    df = calculate_feature_matrix(
        entityset=es,
        features=[time_since_product],
        cutoff_time=pd.Timestamp("2011-04-10 11:10:30"),
    )
    answers = [
        -88830,
        -88819,
        -88803,
        -88797,
        -88771,
        -88770,
        -88760,
        -88749,
        -88740,
        -88227,
        -1830,
        -1809,
        -1750,
        -1740,
        -1723,
        np.nan,
        np.nan,
    ]

    for x, y in zip(df[time_since_product.get_name()], answers):
        assert (pd.isnull(x) and pd.isnull(y)) or x == y


def test_groupby_multi_output_stacking(es):
    class TestTime(TransformPrimitive):
        name = "test_time"
        input_types = [ColumnSchema(logical_type=Datetime)]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        number_output_features = 6

    fl = dfs(
        entityset=es,
        target_dataframe_name="sessions",
        agg_primitives=["sum"],
        groupby_trans_primitives=[TestTime],
        features_only=True,
        max_depth=4,
    )

    for i in range(6):
        f = "SUM(log.TEST_TIME(datetime)[%d] by product_id)" % i
        assert feature_with_name(fl, f)
        assert ("customers.SUM(log.TEST_TIME(datetime)[%d] by session_id)" % i) in fl


def test_serialization(es):
    value = IdentityFeature(es["log"].ww["value"])
    zipcode = IdentityFeature(es["log"].ww["zipcode"])
    primitive = CumSum()
    groupby = feature_base.GroupByTransformFeature(value, primitive, zipcode)

    dictionary = {
        "name": "CUM_SUM(value) by zipcode",
        "base_features": [value.unique_name()],
        "primitive": primitive,
        "groupby": zipcode.unique_name(),
    }

    assert dictionary == groupby.get_arguments()
    dependencies = {
        value.unique_name(): value,
        zipcode.unique_name(): zipcode,
    }
    assert groupby == feature_base.GroupByTransformFeature.from_dictionary(
        dictionary,
        es,
        dependencies,
        primitive,
    )


def test_groupby_with_multioutput_primitive(es):
    class MultiCumSum(TransformPrimitive):
        name = "multi_cum_sum"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        number_output_features = 3

        def get_function(self):
            def multi_cum_sum(x):
                return x.cumsum(), x.cummax(), x.cummin()

            return multi_cum_sum

    fm, _ = dfs(
        entityset=es,
        target_dataframe_name="customers",
        trans_primitives=[],
        agg_primitives=[],
        groupby_trans_primitives=[MultiCumSum, CumSum, CumMax, CumMin],
    )

    # Calculate output in a separate DFS call to make sure the multi-output code
    # does not alter any values
    fm2, _ = dfs(
        entityset=es,
        target_dataframe_name="customers",
        trans_primitives=[],
        agg_primitives=[],
        groupby_trans_primitives=[CumSum, CumMax, CumMin],
    )

    answer_cols = [
        ["CUM_SUM(age) by cohort", "CUM_SUM(age) by région_id"],
        ["CUM_MAX(age) by cohort", "CUM_MAX(age) by région_id"],
        ["CUM_MIN(age) by cohort", "CUM_MIN(age) by région_id"],
    ]

    for i in range(3):
        # Check that multi-output gives correct answers
        f = "MULTI_CUM_SUM(age)[%d] by cohort" % i
        assert f in fm.columns
        for x, y in zip(fm[f].values, fm[answer_cols[i][0]].values):
            assert x == y
        f = "MULTI_CUM_SUM(age)[%d] by région_id" % i
        assert f in fm.columns
        for x, y in zip(fm[f].values, fm[answer_cols[i][1]].values):
            assert x == y
        # Verify single output results are unchanged by inclusion of
        # multi-output primitive
        for x, y in zip(fm[answer_cols[i][0]], fm2[answer_cols[i][0]]):
            assert x == y
        for x, y in zip(fm[answer_cols[i][1]], fm2[answer_cols[i][1]]):
            assert x == y


def test_groupby_with_multioutput_primitive_custom_names(es):
    class MultiCumSum(TransformPrimitive):
        name = "multi_cum_sum"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        number_output_features = 3

        def get_function(self):
            def multi_cum_sum(x):
                return x.cumsum(), x.cummax(), x.cummin()

            return multi_cum_sum

        def generate_names(primitive, base_feature_names):
            return ["CUSTOM_SUM", "CUSTOM_MAX", "CUSTOM_MIN"]

    fm, _ = dfs(
        entityset=es,
        target_dataframe_name="customers",
        trans_primitives=[],
        agg_primitives=[],
        groupby_trans_primitives=[MultiCumSum, CumSum, CumMax, CumMin],
    )

    answer_cols = [
        ["CUM_SUM(age) by cohort", "CUM_SUM(age) by région_id"],
        ["CUM_MAX(age) by cohort", "CUM_MAX(age) by région_id"],
        ["CUM_MIN(age) by cohort", "CUM_MIN(age) by région_id"],
    ]

    expected_names = [
        ["CUSTOM_SUM by cohort", "CUSTOM_SUM by région_id"],
        ["CUSTOM_MAX by cohort", "CUSTOM_MAX by région_id"],
        ["CUSTOM_MIN by cohort", "CUSTOM_MIN by région_id"],
    ]

    for i in range(3):
        f = expected_names[i][0]
        assert f in fm.columns
        for x, y in zip(fm[f].values, fm[answer_cols[i][0]].values):
            assert x == y
        f = expected_names[i][1]
        assert f in fm.columns
        for x, y in zip(fm[f].values, fm[answer_cols[i][1]].values):
            assert x == y


================================================
FILE: featuretools/tests/primitive_tests/test_identity_features.py
================================================
from featuretools import IdentityFeature
from featuretools.primitives.utils import PrimitivesDeserializer


def test_relationship_path(es):
    value = IdentityFeature(es["log"].ww["value"])
    assert len(value.relationship_path) == 0


def test_serialization(es):
    value = IdentityFeature(es["log"].ww["value"])

    dictionary = {
        "name": "value",
        "column_name": "value",
        "dataframe_name": "log",
    }

    assert dictionary == value.get_arguments()
    assert value == IdentityFeature.from_dictionary(
        dictionary,
        es,
        {},
        PrimitivesDeserializer,
    )


================================================
FILE: featuretools/tests/primitive_tests/test_overrides.py
================================================
from featuretools import Feature, calculate_feature_matrix
from featuretools.primitives import (
    AddNumeric,
    AddNumericScalar,
    Count,
    DivideByFeature,
    DivideNumeric,
    DivideNumericScalar,
    Equal,
    EqualScalar,
    GreaterThan,
    GreaterThanEqualTo,
    GreaterThanEqualToScalar,
    GreaterThanScalar,
    LessThan,
    LessThanEqualTo,
    LessThanEqualToScalar,
    LessThanScalar,
    ModuloByFeature,
    ModuloNumeric,
    ModuloNumericScalar,
    MultiplyNumeric,
    MultiplyNumericScalar,
    Negate,
    NotEqual,
    NotEqualScalar,
    ScalarSubtractNumericFeature,
    SubtractNumeric,
    SubtractNumericScalar,
    Sum,
)


def test_overrides(es):
    value = Feature(es["log"].ww["value"])
    value2 = Feature(es["log"].ww["value_2"])

    feats = [
        AddNumeric,
        SubtractNumeric,
        MultiplyNumeric,
        DivideNumeric,
        ModuloNumeric,
        GreaterThan,
        LessThan,
        Equal,
        NotEqual,
        GreaterThanEqualTo,
        LessThanEqualTo,
    ]
    assert Feature(value, primitive=Negate).unique_name() == (-value).unique_name()

    compares = [(value, value), (value, value2)]
    overrides = [
        value + value,
        value - value,
        value * value,
        value / value,
        value % value,
        value > value,
        value < value,
        value == value,
        value != value,
        value >= value,
        value <= value,
        value + value2,
        value - value2,
        value * value2,
        value / value2,
        value % value2,
        value > value2,
        value < value2,
        value == value2,
        value != value2,
        value >= value2,
        value <= value2,
    ]

    for left, right in compares:
        for feat in feats:
            f = Feature([left, right], primitive=feat)
            o = overrides.pop(0)
            assert o.unique_name() == f.unique_name()


def test_override_boolean(es):
    count = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    count_lo = Feature(count, primitive=GreaterThanScalar(1))
    count_hi = Feature(count, primitive=LessThanScalar(10))

    to_test = [[True, True, True], [True, True, False], [False, False, True]]

    features = []
    features.append(count_lo.OR(count_hi))
    features.append(count_lo.AND(count_hi))
    features.append(~(count_lo.AND(count_hi)))

    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=[0, 1, 2],
    )
    for i, test in enumerate(to_test):
        v = df[features[i].get_name()].tolist()
        assert v == test


def test_scalar_overrides(es):
    value = Feature(es["log"].ww["value"])

    feats = [
        AddNumericScalar,
        SubtractNumericScalar,
        MultiplyNumericScalar,
        DivideNumericScalar,
        ModuloNumericScalar,
        GreaterThanScalar,
        LessThanScalar,
        EqualScalar,
        NotEqualScalar,
        GreaterThanEqualToScalar,
        LessThanEqualToScalar,
    ]

    overrides = [
        value + 2,
        value - 2,
        value * 2,
        value / 2,
        value % 2,
        value > 2,
        value < 2,
        value == 2,
        value != 2,
        value >= 2,
        value <= 2,
    ]

    for feat in feats:
        f = Feature(value, primitive=feat(2))
        o = overrides.pop(0)
        assert o.unique_name() == f.unique_name()

    value2 = Feature(es["log"].ww["value_2"])

    reverse_feats = [
        AddNumericScalar,
        ScalarSubtractNumericFeature,
        MultiplyNumericScalar,
        DivideByFeature,
        ModuloByFeature,
        GreaterThanScalar,
        LessThanScalar,
        EqualScalar,
        NotEqualScalar,
        GreaterThanEqualToScalar,
        LessThanEqualToScalar,
    ]
    reverse_overrides = [
        2 + value2,
        2 - value2,
        2 * value2,
        2 / value2,
        2 % value2,
        2 < value2,
        2 > value2,
        2 == value2,
        2 != value2,
        2 <= value2,
        2 >= value2,
    ]
    for feat in reverse_feats:
        f = Feature(value2, primitive=feat(2))
        o = reverse_overrides.pop(0)
        assert o.unique_name() == f.unique_name()


def test_override_cmp_from_column(es):
    count_lo = Feature(es["log"].ww["value"]) > 1

    to_test = [False, True, True]

    features = [count_lo]

    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=[0, 1, 2],
    )
    v = df[count_lo.get_name()].tolist()
    for i, test in enumerate(to_test):
        assert v[i] == test


def test_override_cmp(es):
    count = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    _sum = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="sessions",
        primitive=Sum,
    )
    gt_lo = count > 1
    gt_other = count > _sum
    ge_lo = count >= 1
    ge_other = count >= _sum
    lt_hi = count < 10
    lt_other = count < _sum
    le_hi = count <= 10
    le_other = count <= _sum
    ne_lo = count != 1
    ne_other = count != _sum

    to_test = [
        [True, True, False],
        [False, False, True],
        [True, True, True],
        [False, False, True],
        [True, True, True],
        [True, True, False],
        [True, True, True],
        [True, True, False],
    ]
    features = [
        gt_lo,
        gt_other,
        ge_lo,
        ge_other,
        lt_hi,
        lt_other,
        le_hi,
        le_other,
        ne_lo,
        ne_other,
    ]

    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=[0, 1, 2],
    )
    for i, test in enumerate(to_test):
        v = df[features[i].get_name()].tolist()
        assert v == test


================================================
FILE: featuretools/tests/primitive_tests/test_primitive_base.py
================================================
from datetime import datetime

import numpy as np
import pandas as pd
from pytest import raises

from featuretools.primitives import Haversine, IsIn, IsNull, Max, TimeSinceLast
from featuretools.primitives.base import TransformPrimitive


def test_call_agg():
    primitive = Max()

    # the assert is run twice on purpose
    for _ in range(2):
        assert 5 == primitive(range(6))


def test_call_trans():
    primitive = IsNull()
    for _ in range(2):
        assert pd.Series([False] * 6).equals(primitive(range(6)))


def test_uses_calc_time():
    primitive = TimeSinceLast()
    primitive_h = TimeSinceLast(unit="hours")
    datetimes = pd.Series([datetime(2015, 6, 6), datetime(2015, 6, 7)])
    answer = 86400.0
    answer_h = 24.0
    assert answer == primitive(datetimes, time=datetime(2015, 6, 8))
    assert answer_h == primitive_h(datetimes, time=datetime(2015, 6, 8))


def test_call_multiple_args():
    primitive = Haversine()
    data1 = [(42.4, -71.1), (40.0, -122.4)]
    data2 = [(40.0, -122.4), (41.2, -96.75)]
    answer = [2631.231, 1343.289]

    for _ in range(2):
        assert np.round(primitive(data1, data2), 3).tolist() == answer


def test_get_function_called_once():
    class TestPrimitive(TransformPrimitive):
        def __init__(self):
            self.get_function_call_count = 0

        def get_function(self):
            self.get_function_call_count += 1

            def test(x):
                return x

            return test

    primitive = TestPrimitive()

    for _ in range(2):
        primitive(range(6))

    assert primitive.get_function_call_count == 1


def test_multiple_arg_string():
    class Primitive(TransformPrimitive):
        def __init__(self, bool=True, int=0, float=None):
            self.bool = bool
            self.int = int
            self.float = float

    primitive = Primitive(bool=True, int=4, float=0.1)
    string = primitive.get_args_string()
    assert string == ", int=4, float=0.1"


def test_single_args_string():
    assert IsIn([1, 2, 3]).get_args_string() == ", list_of_outputs=[1, 2, 3]"


def test_args_string_default():
    assert IsIn().get_args_string() == ""


def test_args_string_mixed():
    class Primitive(TransformPrimitive):
        def __init__(self, bool=True, int=0, float=None):
            self.bool = bool
            self.int = int
            self.float = float

    primitive = Primitive(bool=False, int=0)
    string = primitive.get_args_string()
    assert string == ", bool=False"


def test_args_string_undefined():
    string = Max().get_args_string()
    assert string == ""


def test_args_string_error():
    class Primitive(TransformPrimitive):
        def __init__(self, bool=True, int=0, float=None):
            pass

    with raises(AssertionError, match="must be attribute"):
        Primitive(bool=True, int=4, float=0.1).get_args_string()


================================================
FILE: featuretools/tests/primitive_tests/test_primitive_utils.py
================================================
import os

import pytest

from featuretools import list_primitives, summarize_primitives
from featuretools.primitives import (
    AddNumericScalar,
    Age,
    Count,
    Day,
    Diff,
    GreaterThan,
    Haversine,
    IsFreeEmailDomain,
    IsNull,
    Last,
    Max,
    Mean,
    Min,
    Mode,
    Month,
    MultiplyBoolean,
    NMostCommon,
    NumCharacters,
    NumericLag,
    NumUnique,
    NumWords,
    PercentTrue,
    Skew,
    Std,
    Sum,
    Weekday,
    Year,
    get_aggregation_primitives,
    get_default_aggregation_primitives,
    get_default_transform_primitives,
    get_transform_primitives,
)
from featuretools.primitives.base import PrimitiveBase
from featuretools.primitives.base.transform_primitive_base import TransformPrimitive
from featuretools.primitives.utils import (
    _check_input_types,
    _get_descriptions,
    _get_summary_primitives,
    _get_unique_input_types,
    list_primitive_files,
    load_primitive_from_file,
)


def test_list_primitives_order():
    df = list_primitives()
    all_primitives = get_transform_primitives()
    all_primitives.update(get_aggregation_primitives())

    for name, primitive in all_primitives.items():
        assert name in df["name"].values
        row = df.loc[df["name"] == name].iloc[0]
        actual_desc = _get_descriptions([primitive])[0]
        if actual_desc:
            assert actual_desc == row["description"]
        assert row["valid_inputs"] == ", ".join(
            _get_unique_input_types(primitive.input_types),
        )
        expected_return_type = (
            str(primitive.return_type) if primitive.return_type is not None else None
        )
        assert row["return_type"] == expected_return_type

    types = df["type"].values
    assert "aggregation" in types
    assert "transform" in types


def test_valid_input_types():
    actual = _get_unique_input_types(Haversine.input_types)
    assert actual == {"<ColumnSchema (Logical Type = LatLong)>"}
    actual = _get_unique_input_types(MultiplyBoolean.input_types)
    assert actual == {
        "<ColumnSchema (Logical Type = Boolean)>",
        "<ColumnSchema (Logical Type = BooleanNullable)>",
    }
    actual = _get_unique_input_types(Sum.input_types)
    assert actual == {"<ColumnSchema (Semantic Tags = ['numeric'])>"}


def test_descriptions():
    primitives = {
        NumCharacters: "Calculates the number of characters in a given string, including whitespace and punctuation.",
        Day: "Determines the day of the month from a datetime.",
        Last: "Determines the last value in a list.",
        GreaterThan: "Determines if values in one list are greater than another list.",
    }
    assert _get_descriptions(list(primitives.keys())) == list(primitives.values())


def test_get_descriptions_doesnt_truncate_primitive_description():
    # single line
    descr = _get_descriptions([IsNull])
    assert descr[0] == "Determines if a value is null."

    # multiple line; one sentence
    descr = _get_descriptions([Diff])
    assert (
        descr[0]
        == "Computes the difference between the value in a list and the previous value in that list."
    )

    # multiple lines; multiple sentences
    class TestPrimitive(TransformPrimitive):
        """This is text that continues on after the line break
            and ends in a period.
            This is text on one line without a period

        Examples:
            >>> absolute = Absolute()
            >>> absolute([3.0, -5.0, -2.4]).tolist()
            [3.0, 5.0, 2.4]
        """

        name = "test_primitive"

    descr = _get_descriptions([TestPrimitive])
    assert (
        descr[0]
        == "This is text that continues on after the line break and ends in a period. This is text on one line without a period"
    )

    # docstring ends after description
    class TestPrimitive2(TransformPrimitive):
        """This is text that continues on after the line break
        and ends in a period.
        This is text on one line without a period
        """

        name = "test_primitive"

    descr = _get_descriptions([TestPrimitive2])
    assert (
        descr[0]
        == "This is text that continues on after the line break and ends in a period. This is text on one line without a period"
    )


def test_get_default_aggregation_primitives():
    primitives = get_default_aggregation_primitives()
    expected_primitives = [
        Sum,
        Std,
        Max,
        Skew,
        Min,
        Mean,
        Count,
        PercentTrue,
        NumUnique,
        Mode,
    ]
    assert set(primitives) == set(expected_primitives)


def test_get_default_transform_primitives():
    primitives = get_default_transform_primitives()
    expected_primitives = [
        Age,
        Day,
        Year,
        Month,
        Weekday,
        Haversine,
        NumWords,
        NumCharacters,
    ]
    assert set(primitives) == set(expected_primitives)


@pytest.fixture
def this_dir():
    return os.path.dirname(os.path.abspath(__file__))


@pytest.fixture
def primitives_to_install_dir(this_dir):
    return os.path.join(this_dir, "primitives_to_install")


@pytest.fixture
def bad_primitives_files_dir(this_dir):
    return os.path.join(this_dir, "bad_primitive_files")


def test_list_primitive_files(primitives_to_install_dir):
    files = list_primitive_files(primitives_to_install_dir)
    custom_max_file = os.path.join(primitives_to_install_dir, "custom_max.py")
    custom_mean_file = os.path.join(primitives_to_install_dir, "custom_mean.py")
    custom_sum_file = os.path.join(primitives_to_install_dir, "custom_sum.py")
    assert {custom_max_file, custom_mean_file, custom_sum_file}.issubset(set(files))


def test_load_primitive_from_file(primitives_to_install_dir):
    primitve_file = os.path.join(primitives_to_install_dir, "custom_max.py")
    primitive_name, primitive_obj = load_primitive_from_file(primitve_file)
    assert issubclass(primitive_obj, PrimitiveBase)


def test_errors_more_than_one_primitive_in_file(bad_primitives_files_dir):
    primitive_file = os.path.join(bad_primitives_files_dir, "multiple_primitives.py")
    error_text = "More than one primitive defined in file {}".format(primitive_file)
    with pytest.raises(RuntimeError) as excinfo:
        load_primitive_from_file(primitive_file)
    assert str(excinfo.value) == error_text


def test_errors_no_primitive_in_file(bad_primitives_files_dir):
    primitive_file = os.path.join(bad_primitives_files_dir, "no_primitives.py")
    error_text = "No primitive defined in file {}".format(primitive_file)
    with pytest.raises(RuntimeError) as excinfo:
        load_primitive_from_file(primitive_file)
    assert str(excinfo.value) == error_text


def test_check_input_types():
    primitives = [Sum, Weekday, PercentTrue, Day, Std, NumericLag]
    log_in_type_checks = set()
    sem_tag_type_checks = set()
    unique_input_types = set()
    expected_log_in_check = {
        "boolean_nullable",
        "boolean",
        "datetime",
    }
    expected_sem_tag_type_check = {"numeric", "time_index"}
    expected_unique_input_types = {
        "<ColumnSchema (Logical Type = BooleanNullable)>",
        "<ColumnSchema (Semantic Tags = ['numeric'])>",
        "<ColumnSchema (Logical Type = Boolean)>",
        "<ColumnSchema (Logical Type = Datetime)>",
        "<ColumnSchema (Semantic Tags = ['time_index'])>",
    }
    for prim in primitives:
        input_types_flattened = prim.flatten_nested_input_types(prim.input_types)
        _check_input_types(
            input_types_flattened,
            log_in_type_checks,
            sem_tag_type_checks,
            unique_input_types,
        )

    assert log_in_type_checks == expected_log_in_check
    assert sem_tag_type_checks == expected_sem_tag_type_check
    assert unique_input_types == expected_unique_input_types


def test_get_summary_primitives():
    primitives = [
        Sum,
        Weekday,
        PercentTrue,
        Day,
        Std,
        NumericLag,
        AddNumericScalar,
        IsFreeEmailDomain,
        NMostCommon,
    ]
    primitives_summary = _get_summary_primitives(primitives)
    expected_unique_input_types = 7
    expected_unique_output_types = 6
    expected_uses_multi_input = 2
    expected_uses_multi_output = 1
    expected_uses_external_data = 1
    expected_controllable = 3
    expected_datetime_inputs = 2
    expected_bool = 1
    expected_bool_nullable = 1
    expected_time_index_tag = 1

    assert (
        primitives_summary["general_metrics"]["unique_input_types"]
        == expected_unique_input_types
    )
    assert (
        primitives_summary["general_metrics"]["unique_output_types"]
        == expected_unique_output_types
    )
    assert (
        primitives_summary["general_metrics"]["uses_multi_input"]
        == expected_uses_multi_input
    )
    assert (
        primitives_summary["general_metrics"]["uses_multi_output"]
        == expected_uses_multi_output
    )
    assert (
        primitives_summary["general_metrics"]["uses_external_data"]
        == expected_uses_external_data
    )
    assert (
        primitives_summary["general_metrics"]["are_controllable"]
        == expected_controllable
    )
    assert (
        primitives_summary["semantic_tag_metrics"]["time_index"]
        == expected_time_index_tag
    )
    assert (
        primitives_summary["logical_type_input_metrics"]["datetime"]
        == expected_datetime_inputs
    )
    assert primitives_summary["logical_type_input_metrics"]["boolean"] == expected_bool
    assert (
        primitives_summary["logical_type_input_metrics"]["boolean_nullable"]
        == expected_bool_nullable
    )


def test_summarize_primitives():
    df = summarize_primitives()
    trans_prims = get_transform_primitives()
    agg_prims = get_aggregation_primitives()
    tot_trans = len(trans_prims)
    tot_agg = len(agg_prims)
    tot_prims = tot_trans + tot_agg

    assert df["Count"].iloc[0] == tot_prims
    assert df["Count"].iloc[1] == tot_agg
    assert df["Count"].iloc[2] == tot_trans


================================================
FILE: featuretools/tests/primitive_tests/test_rolling_primitive_utils.py
================================================
from unittest.mock import patch

import numpy as np
import pandas as pd
import pytest

from featuretools.primitives import (
    RollingCount,
    RollingMax,
    RollingMean,
    RollingMin,
    RollingSTD,
    RollingTrend,
)
from featuretools.primitives.standard.transform.time_series.utils import (
    _get_rolled_series_without_gap,
    apply_roll_with_offset_gap,
    roll_series_with_gap,
)
from featuretools.tests.primitive_tests.utils import get_number_from_offset


def test_get_rolled_series_without_gap(window_series):
    # Data is daily, so number of rows should be number of days not included in the gap
    assert len(_get_rolled_series_without_gap(window_series, "11D")) == 9
    assert len(_get_rolled_series_without_gap(window_series, "0D")) == 20
    assert len(_get_rolled_series_without_gap(window_series, "48H")) == 18
    assert len(_get_rolled_series_without_gap(window_series, "4H")) == 19


def test_get_rolled_series_without_gap_not_uniform(window_series):
    non_uniform_series = window_series.iloc[[0, 2, 5, 6, 8, 9]]

    assert len(_get_rolled_series_without_gap(non_uniform_series, "10D")) == 0
    assert len(_get_rolled_series_without_gap(non_uniform_series, "0D")) == 6
    assert len(_get_rolled_series_without_gap(non_uniform_series, "48H")) == 4
    assert len(_get_rolled_series_without_gap(non_uniform_series, "4H")) == 5
    assert len(_get_rolled_series_without_gap(non_uniform_series, "4D")) == 3
    assert len(_get_rolled_series_without_gap(non_uniform_series, "4D2H")) == 2


def test_get_rolled_series_without_gap_empty_series(window_series):
    empty_series = pd.Series([], dtype="object")
    assert len(_get_rolled_series_without_gap(empty_series, "1D")) == 0
    assert len(_get_rolled_series_without_gap(empty_series, "0D")) == 0


def test_get_rolled_series_without_gap_large_bound(window_series):
    assert len(_get_rolled_series_without_gap(window_series, "100D")) == 0
    assert (
        len(
            _get_rolled_series_without_gap(
                window_series.iloc[[0, 2, 5, 6, 8, 9]],
                "20D",
            ),
        )
        == 0
    )


@pytest.mark.parametrize(
    "window_length, gap",
    [
        (3, 2),
        (3, 4),  # gap larger than window
        (2, 0),  # gap explicitly set to 0
        ("3d", "2d"),  # using offset aliases
        ("3d", "4d"),
        ("4d", "0d"),
    ],
)
def test_roll_series_with_gap(window_length, gap, window_series):
    rolling_max = roll_series_with_gap(
        window_series,
        window_length,
        gap=gap,
        min_periods=1,
    ).max()
    rolling_min = roll_series_with_gap(
        window_series,
        window_length,
        gap=gap,
        min_periods=1,
    ).min()

    assert len(rolling_max) == len(window_series)
    assert len(rolling_min) == len(window_series)

    gap_num = get_number_from_offset(gap)
    window_length_num = get_number_from_offset(window_length)
    for i in range(len(window_series)):
        start_idx = i - gap_num - window_length_num + 1

        if isinstance(gap, str):
            # No gap functionality is happening, so gap isn't taken account in the end index
            # it's like the gap is 0; it includes the row itself
            end_idx = i
        else:
            end_idx = i - gap_num

        # If start and end are negative, they're entirely before
        if start_idx < 0 and end_idx < 0:
            assert pd.isnull(rolling_max.iloc[i])
            assert pd.isnull(rolling_min.iloc[i])
            continue

        if start_idx < 0:
            start_idx = 0

        # Because the row values are a range from 0 to 20, the rolling min will be the start index
        # and the rolling max will be the end idx
        assert rolling_min.iloc[i] == start_idx
        assert rolling_max.iloc[i] == end_idx


@pytest.mark.parametrize("window_length", [3, "3d"])
def test_roll_series_with_no_gap(window_length, window_series):
    actual_rolling = roll_series_with_gap(
        window_series,
        window_length,
        gap=0,
        min_periods=1,
    ).mean()
    expected_rolling = window_series.rolling(window_length, min_periods=1).mean()

    pd.testing.assert_series_equal(actual_rolling, expected_rolling)


@pytest.mark.parametrize(
    "window_length, gap",
    [
        (6, 2),
        (6, 0),  # No gap - changes early values
        ("6d", "0d"),  # Uses offset aliases
        ("6d", "2d"),
    ],
)
def test_roll_series_with_gap_early_values(window_length, gap, window_series):
    gap_num = get_number_from_offset(gap)
    window_length_num = get_number_from_offset(window_length)

    # Default min periods is 1 - will include all
    default_partial_values = roll_series_with_gap(
        window_series,
        window_length,
        gap=gap,
        min_periods=1,
    ).count()
    num_empty_aggregates = len(default_partial_values.loc[default_partial_values == 0])
    num_partial_aggregates = len(
        (default_partial_values.loc[default_partial_values != 0]).loc[
            default_partial_values < window_length_num
        ],
    )

    assert num_partial_aggregates == window_length_num - 1
    if isinstance(gap, str):
        # gap isn't handled, so we'll always at least include the row itself
        assert num_empty_aggregates == 0
    else:
        assert num_empty_aggregates == gap_num

    # Make min periods the size of the window
    no_partial_values = roll_series_with_gap(
        window_series,
        window_length,
        gap=gap,
        min_periods=window_length_num,
    ).count()
    num_null_aggregates = len(no_partial_values.loc[pd.isna(no_partial_values)])
    num_partial_aggregates = len(
        no_partial_values.loc[no_partial_values < window_length_num],
    )

    # because we shift, gap is included as nan values in the series.
    # Count treats nans in a window as values that don't get counted,
    # so the gap rows get included in the count for whether a window has "min periods".
    # This is different than max, for example, which does not count nans in a window as values towards "min periods"
    assert num_null_aggregates == window_length_num - 1
    if isinstance(gap, str):
        # gap isn't handled, so we'll never have any partial aggregates
        assert num_partial_aggregates == 0
    else:
        assert num_partial_aggregates == gap_num


def test_roll_series_with_gap_nullable_types(window_series):
    window_length = 3
    gap = 2
    min_periods = 1
    # Because we're inserting nans, confirm that nullability of the dtype doesn't have an impact on the results
    nullable_series = window_series.astype("Int64")
    non_nullable_series = window_series.astype("int64")

    nullable_rolling_max = roll_series_with_gap(
        nullable_series,
        window_length,
        gap=gap,
        min_periods=min_periods,
    ).max()
    non_nullable_rolling_max = roll_series_with_gap(
        non_nullable_series,
        window_length,
        gap=gap,
        min_periods=min_periods,
    ).max()

    pd.testing.assert_series_equal(nullable_rolling_max, non_nullable_rolling_max)


def test_roll_series_with_gap_nullable_types_with_nans(window_series):
    window_length = 3
    gap = 2
    min_periods = 1
    nullable_floats = window_series.astype("float64").replace(
        {1: np.nan, 3: np.nan},
    )
    nullable_ints = nullable_floats.astype("Int64")

    nullable_ints_rolling_max = roll_series_with_gap(
        nullable_ints,
        window_length,
        gap=gap,
        min_periods=min_periods,
    ).max()
    nullable_floats_rolling_max = roll_series_with_gap(
        nullable_floats,
        window_length,
        gap=gap,
        min_periods=min_periods,
    ).max()

    pd.testing.assert_series_equal(
        nullable_ints_rolling_max,
        nullable_floats_rolling_max,
    )

    expected_early_values = [np.nan, np.nan, 0, 0, 2, 2, 4] + list(
        range(7 - gap, len(window_series) - gap),
    )
    for i in range(len(window_series)):
        actual = nullable_floats_rolling_max.iloc[i]
        expected = expected_early_values[i]

        if pd.isnull(actual):
            assert pd.isnull(expected)
        else:
            assert actual == expected


@pytest.mark.parametrize(
    "window_length, gap",
    [
        ("3d", "2d"),
        ("3d", "4d"),
        ("4d", "0d"),
    ],
)
def test_apply_roll_with_offset_gap(window_length, gap, window_series):
    def max_wrapper(sub_s):
        return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=1)

    rolling_max_obj = roll_series_with_gap(
        window_series,
        window_length,
        gap=gap,
        min_periods=1,
    )
    rolling_max_series = rolling_max_obj.apply(max_wrapper)

    def min_wrapper(sub_s):
        return apply_roll_with_offset_gap(sub_s, gap, min, min_periods=1)

    rolling_min_obj = roll_series_with_gap(
        window_series,
        window_length,
        gap=gap,
        min_periods=1,
    )
    rolling_min_series = rolling_min_obj.apply(min_wrapper)

    assert len(rolling_max_series) == len(window_series)
    assert len(rolling_min_series) == len(window_series)

    gap_num = get_number_from_offset(gap)
    window_length_num = get_number_from_offset(window_length)
    for i in range(len(window_series)):
        start_idx = i - gap_num - window_length_num + 1
        # Now that we have the _apply call, this acts as expected
        end_idx = i - gap_num

        # If start and end are negative, they're entirely before
        if start_idx < 0 and end_idx < 0:
            assert pd.isnull(rolling_max_series.iloc[i])
            assert pd.isnull(rolling_min_series.iloc[i])
            continue

        if start_idx < 0:
            start_idx = 0

        # Because the row values are a range from 0 to 20, the rolling min will be the start index
        # and the rolling max will be the end idx
        assert rolling_min_series.iloc[i] == start_idx
        assert rolling_max_series.iloc[i] == end_idx


@pytest.mark.parametrize(
    "min_periods",
    [1, 0, None],
)
def test_apply_roll_with_offset_gap_default_min_periods(min_periods, window_series):
    window_length = "5d"
    window_length_num = 5
    gap = "3d"
    gap_num = 3

    def count_wrapper(sub_s):
        return apply_roll_with_offset_gap(sub_s, gap, len, min_periods=min_periods)

    rolling_count_obj = roll_series_with_gap(
        window_series,
        window_length,
        gap=gap,
        min_periods=min_periods,
    )
    rolling_count_series = rolling_count_obj.apply(count_wrapper)

    # gap essentially creates a rolling series that has no elements; which should be nan
    # to differentiate from when a window only has null values
    num_empty_aggregates = rolling_count_series.isna().sum()
    num_partial_aggregates = len(
        (rolling_count_series.loc[rolling_count_series != 0]).loc[
            rolling_count_series < window_length_num
        ],
    )

    assert num_empty_aggregates == gap_num
    assert num_partial_aggregates == window_length_num - 1


@pytest.mark.parametrize(
    "min_periods",
    [2, 3, 4, 5],
)
def test_apply_roll_with_offset_gap_min_periods(min_periods, window_series):
    window_length = "5d"
    window_length_num = 5
    gap = "3d"
    gap_num = 3

    def count_wrapper(sub_s):
        return apply_roll_with_offset_gap(sub_s, gap, len, min_periods=min_periods)

    rolling_count_obj = roll_series_with_gap(
        window_series,
        window_length,
        gap=gap,
        min_periods=min_periods,
    )
    rolling_count_series = rolling_count_obj.apply(count_wrapper)

    # gap essentially creates rolling series that have no elements; which should be nan
    # to differentiate from when a window only has null values
    num_empty_aggregates = rolling_count_series.isna().sum()
    num_partial_aggregates = len(
        (rolling_count_series.loc[rolling_count_series != 0]).loc[
            rolling_count_series < window_length_num
        ],
    )

    assert num_empty_aggregates == min_periods - 1 + gap_num
    assert num_partial_aggregates == window_length_num - min_periods


def test_apply_roll_with_offset_gap_non_uniform():
    window_length = "3d"
    gap = "3d"
    min_periods = 1
    # When the data isn't uniform, this impacts the number of values in each rolling window
    datetimes = (
        list(pd.date_range(start="2017-01-01", freq="1d", periods=7))
        + list(pd.date_range(start="2017-02-01", freq="2d", periods=7))
        + list(pd.date_range(start="2017-03-01", freq="1d", periods=7))
    )
    no_freq_series = pd.Series(range(len(datetimes)), index=datetimes)

    assert pd.infer_freq(no_freq_series.index) is None

    expected_series = pd.Series(
        [None, None, None, 1, 2, 3, 3]
        + [None, None, 1, 1, 1, 1, 1]
        + [None, None, None, 1, 2, 3, 3],
        index=datetimes,
    )

    def count_wrapper(sub_s):
        return apply_roll_with_offset_gap(sub_s, gap, len, min_periods=min_periods)

    rolling_count_obj = roll_series_with_gap(
        no_freq_series,
        window_length,
        gap=gap,
        min_periods=min_periods,
    )
    rolling_count_series = rolling_count_obj.apply(count_wrapper)

    pd.testing.assert_series_equal(rolling_count_series, expected_series)


def test_apply_roll_with_offset_data_frequency_higher_than_parameters_frequency():
    window_length = "5D"  # 120 hours
    window_length_num = 5
    # In order for min periods to be the length of the window, we multiply 24hours*5
    min_periods = window_length_num * 24

    datetimes = list(pd.date_range(start="2017-01-01", freq="1H", periods=200))
    high_frequency_series = pd.Series(range(200), index=datetimes)

    # Check without gap
    gap = "0d"
    gap_num = 0

    def max_wrapper(sub_s):
        return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=min_periods)

    rolling_max_obj = roll_series_with_gap(
        high_frequency_series,
        window_length,
        min_periods=min_periods,
        gap=gap,
    )
    rolling_max_series = rolling_max_obj.apply(max_wrapper)

    assert rolling_max_series.isna().sum() == (min_periods - 1) + gap_num

    # Check with small gap
    gap = "3H"
    gap_num = 3

    def max_wrapper(sub_s):
        return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=min_periods)

    rolling_max_obj = roll_series_with_gap(
        high_frequency_series,
        window_length,
        min_periods=min_periods,
        gap=gap,
    )
    rolling_max_series = rolling_max_obj.apply(max_wrapper)

    assert rolling_max_series.isna().sum() == (min_periods - 1) + gap_num

    # Check with large gap - in terms of days, so we'll multiply by 24hours for number of nans
    gap = "2D"
    gap_num = 2

    def max_wrapper(sub_s):
        return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=min_periods)

    rolling_max_obj = roll_series_with_gap(
        high_frequency_series,
        window_length,
        min_periods=min_periods,
        gap=gap,
    )
    rolling_max_series = rolling_max_obj.apply(max_wrapper)

    assert rolling_max_series.isna().sum() == (min_periods - 1) + (gap_num * 24)


def test_apply_roll_with_offset_data_min_periods_too_big(window_series):
    window_length = "5D"
    gap = "2d"

    # Since the data has a daily frequency, there will only be, at most, 5 rows in the window
    min_periods = 6

    def max_wrapper(sub_s):
        return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=min_periods)

    rolling_max_obj = roll_series_with_gap(
        window_series,
        window_length,
        min_periods=min_periods,
        gap=gap,
    )
    rolling_max_series = rolling_max_obj.apply(max_wrapper)

    # The resulting series is comprised entirely of nans
    assert rolling_max_series.isna().sum() == len(window_series)


def test_roll_series_with_gap_different_input_types_same_result_uniform(
    window_series,
):
    # Offset inputs will only produce the same results as numeric inputs
    # when the data has a uniform frequency
    offset_gap = "2d"
    offset_window_length = "5d"
    int_gap = 2
    int_window_length = 5
    min_periods = 1

    # Rolling series' with matching input types
    expected_rolling_numeric = roll_series_with_gap(
        window_series,
        window_length=int_window_length,
        gap=int_gap,
        min_periods=min_periods,
    ).max()

    def count_wrapper(sub_s):
        return apply_roll_with_offset_gap(
            sub_s,
            offset_gap,
            max,
            min_periods=min_periods,
        )

    rolling_count_obj = roll_series_with_gap(
        window_series,
        window_length=offset_window_length,
        gap=offset_gap,
        min_periods=min_periods,
    )
    expected_rolling_offset = rolling_count_obj.apply(count_wrapper)

    # confirm that the offset and gap results are equal to one another
    pd.testing.assert_series_equal(expected_rolling_numeric, expected_rolling_offset)

    # Rolling series' with mismatched input types
    mismatched_numeric_gap = roll_series_with_gap(
        window_series,
        window_length=offset_window_length,
        gap=int_gap,
        min_periods=min_periods,
    ).max()
    # Confirm the mismatched results also produce the same results
    pd.testing.assert_series_equal(expected_rolling_numeric, mismatched_numeric_gap)


def test_roll_series_with_gap_incorrect_types(window_series):
    error = "Window length must be either an offset string or an integer."
    with pytest.raises(TypeError, match=error):
        (
            roll_series_with_gap(
                window_series,
                window_length=4.2,
                gap=4,
                min_periods=1,
            ),
        )

    error = "Gap must be either an offset string or an integer."
    with pytest.raises(TypeError, match=error):
        roll_series_with_gap(window_series, window_length=4, gap=4.2, min_periods=1)


def test_roll_series_with_gap_negative_inputs(window_series):
    error = "Window length must be greater than zero."
    with pytest.raises(ValueError, match=error):
        roll_series_with_gap(window_series, window_length=-4, gap=4, min_periods=1)

    error = "Gap must be greater than or equal to zero."
    with pytest.raises(ValueError, match=error):
        roll_series_with_gap(window_series, window_length=4, gap=-4, min_periods=1)


def test_roll_series_with_non_offset_string_inputs(window_series):
    error = "Cannot roll series. The specified gap, test, is not a valid offset alias."
    with pytest.raises(ValueError, match=error):
        roll_series_with_gap(
            window_series,
            window_length="4D",
            gap="test",
            min_periods=1,
        )

    error = "Cannot roll series. The specified window length, test, is not a valid offset alias."
    with pytest.raises(ValueError, match=error):
        roll_series_with_gap(
            window_series,
            window_length="test",
            gap="7D",
            min_periods=1,
        )

    # Test mismatched types error
    error = (
        "Cannot roll series with offset gap, 2d, and numeric window length, 7. "
        "If an offset alias is used for gap, the window length must also be defined as an offset alias. "
        "Please either change gap to be numeric or change window length to be an offset alias."
    )
    with pytest.raises(TypeError, match=error):
        roll_series_with_gap(
            window_series,
            window_length=7,
            gap="2d",
            min_periods=1,
        ).max()


@pytest.mark.parametrize(
    "primitive",
    [RollingCount, RollingMax, RollingMin, RollingMean, RollingSTD, RollingTrend],
)
@patch(
    "featuretools.primitives.standard.transform.time_series.utils.apply_roll_with_offset_gap",
)
def test_no_call_to_apply_roll_with_offset_gap_with_numeric(
    mock_apply_roll,
    primitive,
    window_series,
):
    assert not mock_apply_roll.called

    fully_numeric_primitive = primitive(window_length=3, gap=1)
    primitive_func = fully_numeric_primitive.get_function()
    if isinstance(fully_numeric_primitive, RollingCount):
        pd.Series(primitive_func(window_series.index))
    else:
        pd.Series(
            primitive_func(
                window_series.index,
                pd.Series(window_series.values),
            ),
        )

    assert not mock_apply_roll.called

    offset_window_primitive = primitive(window_length="3d", gap=1)
    primitive_func = offset_window_primitive.get_function()
    if isinstance(offset_window_primitive, RollingCount):
        pd.Series(primitive_func(window_series.index))
    else:
        pd.Series(
            primitive_func(
                window_series.index,
                pd.Series(window_series.values),
            ),
        )

    assert not mock_apply_roll.called

    no_gap_specified_primitive = primitive(window_length="3d")
    primitive_func = no_gap_specified_primitive.get_function()
    if isinstance(no_gap_specified_primitive, RollingCount):
        pd.Series(primitive_func(window_series.index))
    else:
        pd.Series(
            primitive_func(
                window_series.index,
                pd.Series(window_series.values),
            ),
        )

    assert not mock_apply_roll.called

    no_gap_specified_primitive = primitive(window_length="3d", gap="1d")
    primitive_func = no_gap_specified_primitive.get_function()
    if isinstance(no_gap_specified_primitive, RollingCount):
        pd.Series(primitive_func(window_series.index))
    else:
        pd.Series(
            primitive_func(
                window_series.index,
                pd.Series(window_series.values),
            ),
        )

    assert mock_apply_roll.called


================================================
FILE: featuretools/tests/primitive_tests/test_transform_features.py
================================================
from inspect import isclass

import numpy as np
import pandas as pd
import pytest
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import (
    Boolean,
    BooleanNullable,
    Categorical,
    Datetime,
    Double,
    Integer,
    IntegerNullable,
)

from featuretools import (
    AggregationFeature,
    EntitySet,
    Feature,
    IdentityFeature,
    TransformFeature,
    calculate_feature_matrix,
    dfs,
    primitives,
)
from featuretools.computational_backends.feature_set import FeatureSet
from featuretools.computational_backends.feature_set_calculator import (
    FeatureSetCalculator,
)
from featuretools.primitives import (
    Absolute,
    AddNumeric,
    AddNumericScalar,
    Age,
    Count,
    Day,
    Diff,
    DiffDatetime,
    DivideByFeature,
    DivideNumeric,
    DivideNumericScalar,
    Equal,
    EqualScalar,
    FileExtension,
    First,
    FullNameToFirstName,
    FullNameToLastName,
    FullNameToTitle,
    GreaterThan,
    GreaterThanEqualTo,
    GreaterThanEqualToScalar,
    GreaterThanScalar,
    Haversine,
    Hour,
    IsIn,
    IsNull,
    Lag,
    Latitude,
    LessThan,
    LessThanEqualTo,
    LessThanEqualToScalar,
    LessThanScalar,
    Longitude,
    Mode,
    MultiplyBoolean,
    MultiplyNumeric,
    MultiplyNumericBoolean,
    MultiplyNumericScalar,
    Not,
    NotEqual,
    NotEqualScalar,
    NumCharacters,
    NumericLag,
    NumWords,
    Percentile,
    ScalarSubtractNumericFeature,
    SubtractNumeric,
    SubtractNumericScalar,
    Sum,
    TimeSince,
    TransformPrimitive,
    get_transform_primitives,
)
from featuretools.synthesis.deep_feature_synthesis import match


def test_init_and_name(es):
    log = es["log"]
    rating = Feature(IdentityFeature(es["products"].ww["rating"]), "log")
    log_features = [Feature(es["log"].ww[col]) for col in log.columns] + [
        Feature(rating, primitive=GreaterThanScalar(2.5)),
        Feature(rating, primitive=GreaterThanScalar(3.5)),
    ]
    # Add Timedelta feature
    # features.append(pd.Timestamp.now() - Feature(log['datetime']))
    customers_features = [
        Feature(es["customers"].ww[col]) for col in es["customers"].columns
    ]

    # check all transform primitives have a name
    for attribute_string in dir(primitives):
        attr = getattr(primitives, attribute_string)
        if isclass(attr):
            if issubclass(attr, TransformPrimitive) and attr != TransformPrimitive:
                assert getattr(attr, "name") is not None

    trans_primitives = get_transform_primitives().values()

    for transform_prim in trans_primitives:
        # skip automated testing if a few special cases
        features_to_use = log_features
        if transform_prim in [NotEqual, Equal, FileExtension]:
            continue
        if transform_prim in [
            Age,
            FullNameToFirstName,
            FullNameToLastName,
            FullNameToTitle,
        ]:
            features_to_use = customers_features

        # use the input_types matching function from DFS
        input_types = transform_prim.input_types
        if isinstance(input_types[0], list):
            matching_inputs = match(input_types[0], features_to_use)
        else:
            matching_inputs = match(input_types, features_to_use)
        if len(matching_inputs) == 0:
            raise Exception("Transform Primitive %s not tested" % transform_prim.name)
        for prim in matching_inputs:
            instance = Feature(prim, primitive=transform_prim)

            # try to get name and calculate
            instance.get_name()
            calculate_feature_matrix([instance], entityset=es)


def test_relationship_path(es):
    f = TransformFeature(Feature(es["log"].ww["datetime"]), Hour)

    assert len(f.relationship_path) == 0


def test_serialization(es):
    value = IdentityFeature(es["log"].ww["value"])
    primitive = MultiplyNumericScalar(value=2)
    value_x2 = TransformFeature(value, primitive)

    dictionary = {
        "name": value_x2.get_name(),
        "base_features": [value.unique_name()],
        "primitive": primitive,
    }

    assert dictionary == value_x2.get_arguments()
    assert value_x2 == TransformFeature.from_dictionary(
        dictionary,
        es,
        {value.unique_name(): value},
        primitive,
    )


def test_make_trans_feat(es):
    f = Feature(es["log"].ww["datetime"], primitive=Hour)

    feature_set = FeatureSet([f])
    calculator = FeatureSetCalculator(es, feature_set=feature_set)
    df = calculator.run(np.array([0]))
    v = df[f.get_name()][0]
    assert v == 10


@pytest.fixture
def simple_es():
    df = pd.DataFrame(
        {
            "id": range(4),
            "value": pd.Categorical(["a", "c", "b", "d"]),
            "value2": pd.Categorical(["a", "b", "a", "d"]),
            "object": ["time1", "time2", "time3", "time4"],
            "datetime": pd.Series(
                [
                    pd.Timestamp("2001-01-01"),
                    pd.Timestamp("2001-01-02"),
                    pd.Timestamp("2001-01-03"),
                    pd.Timestamp("2001-01-04"),
                ],
            ),
        },
    )

    es = EntitySet("equal_test")
    es.add_dataframe(dataframe_name="values", dataframe=df, index="id")

    return es


def test_equal_categorical(simple_es):
    f1 = Feature(
        [
            IdentityFeature(simple_es["values"].ww["value"]),
            IdentityFeature(simple_es["values"].ww["value2"]),
        ],
        primitive=Equal,
    )

    df = calculate_feature_matrix(entityset=simple_es, features=[f1])
    assert set(simple_es["values"]["value"].cat.categories) != set(
        simple_es["values"]["value2"].cat.categories,
    )
    assert df["value = value2"].to_list() == [
        True,
        False,
        False,
        True,
    ]


def test_equal_different_dtypes(simple_es):
    f1 = Feature(
        [
            IdentityFeature(simple_es["values"].ww["object"]),
            IdentityFeature(simple_es["values"].ww["datetime"]),
        ],
        primitive=Equal,
    )
    f2 = Feature(
        [
            IdentityFeature(simple_es["values"].ww["datetime"]),
            IdentityFeature(simple_es["values"].ww["object"]),
        ],
        primitive=Equal,
    )

    # verify that equals works for different dtypes regardless of order
    df = calculate_feature_matrix(entityset=simple_es, features=[f1, f2])

    assert df["object = datetime"].to_list() == [False, False, False, False]
    assert df["datetime = object"].to_list() == [False, False, False, False]


def test_not_equal_categorical(simple_es):
    f1 = Feature(
        [
            IdentityFeature(simple_es["values"].ww["value"]),
            IdentityFeature(simple_es["values"].ww["value2"]),
        ],
        primitive=NotEqual,
    )

    df = calculate_feature_matrix(entityset=simple_es, features=[f1])

    assert set(simple_es["values"]["value"].cat.categories) != set(
        simple_es["values"]["value2"].cat.categories,
    )
    assert df["value != value2"].to_list() == [
        False,
        True,
        True,
        False,
    ]


def test_not_equal_different_dtypes(simple_es):
    f1 = Feature(
        [
            IdentityFeature(simple_es["values"].ww["object"]),
            IdentityFeature(simple_es["values"].ww["datetime"]),
        ],
        primitive=NotEqual,
    )
    f2 = Feature(
        [
            IdentityFeature(simple_es["values"].ww["datetime"]),
            IdentityFeature(simple_es["values"].ww["object"]),
        ],
        primitive=NotEqual,
    )

    # verify that equals works for different dtypes regardless of order
    df = calculate_feature_matrix(entityset=simple_es, features=[f1, f2])

    assert df["object != datetime"].to_list() == [True, True, True, True]
    assert df["datetime != object"].to_list() == [True, True, True, True]


def test_diff(es):
    value = Feature(es["log"].ww["value"])
    customer_id_feat = Feature(es["sessions"].ww["customer_id"], "log")
    diff1 = Feature(
        value,
        groupby=Feature(es["log"].ww["session_id"]),
        primitive=Diff,
    )
    diff2 = Feature(value, groupby=customer_id_feat, primitive=Diff)

    feature_set = FeatureSet([diff1, diff2])
    calculator = FeatureSetCalculator(es, feature_set=feature_set)
    df = calculator.run(np.array(range(15)))

    val1 = df[diff1.get_name()].tolist()
    val2 = df[diff2.get_name()].tolist()

    correct_vals1 = [
        np.nan,
        5,
        5,
        5,
        5,
        np.nan,
        1,
        1,
        1,
        np.nan,
        np.nan,
        5,
        np.nan,
        7,
        7,
    ]
    correct_vals2 = [np.nan, 5, 5, 5, 5, -20, 1, 1, 1, -3, np.nan, 5, -5, 7, 7]
    np.testing.assert_equal(val1, correct_vals1)
    np.testing.assert_equal(val2, correct_vals2)


def test_diff_shift(es):
    value = Feature(es["log"].ww["value"])
    customer_id_feat = Feature(es["sessions"].ww["customer_id"], "log")
    diff_periods = Feature(value, groupby=customer_id_feat, primitive=Diff(periods=1))

    feature_set = FeatureSet([diff_periods])
    calculator = FeatureSetCalculator(es, feature_set=feature_set)
    df = calculator.run(np.array(range(15)))
    val3 = df[diff_periods.get_name()].tolist()

    correct_vals3 = [np.nan, np.nan, 5, 5, 5, 5, -20, 1, 1, 1, np.nan, np.nan, 5, -5, 7]
    np.testing.assert_equal(val3, correct_vals3)


def test_diff_single_value(es):
    diff = Feature(
        es["stores"].ww["num_square_feet"],
        groupby=Feature(es["stores"].ww["région_id"]),
        primitive=Diff,
    )
    feature_set = FeatureSet([diff])
    calculator = FeatureSetCalculator(es, feature_set=feature_set)
    df = calculator.run(np.array([4]))
    assert df[diff.get_name()][4] == 6000.0


def test_diff_reordered(es):
    sum_feat = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="sessions",
        primitive=Sum,
    )
    diff = Feature(sum_feat, primitive=Diff)
    feature_set = FeatureSet([diff])
    calculator = FeatureSetCalculator(es, feature_set=feature_set)
    df = calculator.run(np.array([4, 2]))
    assert df[diff.get_name()][4] == 16
    assert df[diff.get_name()][2] == -6


def test_diff_single_value_is_nan(es):
    diff = Feature(
        es["stores"].ww["num_square_feet"],
        groupby=Feature(es["stores"].ww["région_id"]),
        primitive=Diff,
    )
    feature_set = FeatureSet([diff])
    calculator = FeatureSetCalculator(es, feature_set=feature_set)
    df = calculator.run(np.array([5]))
    assert df.shape[0] == 1
    assert df[diff.get_name()].dropna().shape[0] == 0


def test_diff_datetime(es):
    diff = Feature(
        es["log"].ww["datetime"],
        primitive=DiffDatetime,
    )
    feature_set = FeatureSet([diff])
    calculator = FeatureSetCalculator(es, feature_set=feature_set)
    df = calculator.run(np.array(range(15)))
    vals = pd.Series(df[diff.get_name()].tolist())
    expected_vals = pd.Series(
        [
            pd.NaT,
            pd.Timedelta(seconds=6),
            pd.Timedelta(seconds=6),
            pd.Timedelta(seconds=6),
            pd.Timedelta(seconds=6),
            pd.Timedelta(seconds=36),
            pd.Timedelta(seconds=9),
            pd.Timedelta(seconds=9),
            pd.Timedelta(seconds=9),
            pd.Timedelta(minutes=8, seconds=33),
            pd.Timedelta(days=1),
            pd.Timedelta(seconds=1),
            pd.Timedelta(seconds=59),
            pd.Timedelta(seconds=3),
            pd.Timedelta(seconds=3),
        ],
    )
    pd.testing.assert_series_equal(vals, expected_vals)


def test_diff_datetime_shift(es):
    diff = Feature(
        es["log"].ww["datetime"],
        primitive=DiffDatetime(periods=1),
    )
    feature_set = FeatureSet([diff])
    calculator = FeatureSetCalculator(es, feature_set=feature_set)
    df = calculator.run(np.array(range(6)))
    vals = pd.Series(df[diff.get_name()].tolist())
    expected_vals = pd.Series(
        [
            pd.NaT,
            pd.NaT,
            pd.Timedelta(seconds=6),
            pd.Timedelta(seconds=6),
            pd.Timedelta(seconds=6),
            pd.Timedelta(seconds=6),
        ],
    )
    pd.testing.assert_series_equal(vals, expected_vals)


def test_compare_of_identity(es):
    to_test = [
        (EqualScalar, [False, False, True, False]),
        (NotEqualScalar, [True, True, False, True]),
        (LessThanScalar, [True, True, False, False]),
        (LessThanEqualToScalar, [True, True, True, False]),
        (GreaterThanScalar, [False, False, False, True]),
        (GreaterThanEqualToScalar, [False, False, True, True]),
    ]

    features = []
    for test in to_test:
        features.append(Feature(es["log"].ww["value"], primitive=test[0](10)))

    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=[0, 1, 2, 3],
    )

    for i, test in enumerate(to_test):
        v = df[features[i].get_name()].tolist()
        assert v == test[1]


def test_compare_of_direct(es):
    log_rating = Feature(es["products"].ww["rating"], "log")
    to_test = [
        (EqualScalar, [False, False, False, False]),
        (NotEqualScalar, [True, True, True, True]),
        (LessThanScalar, [False, False, False, True]),
        (LessThanEqualToScalar, [False, False, False, True]),
        (GreaterThanScalar, [True, True, True, False]),
        (GreaterThanEqualToScalar, [True, True, True, False]),
    ]

    features = []
    for test in to_test:
        features.append(Feature(log_rating, primitive=test[0](4.5)))

    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=[0, 1, 2, 3],
    )

    for i, test in enumerate(to_test):
        v = df[features[i].get_name()].tolist()
        assert v == test[1]


def test_compare_of_transform(es):
    day = Feature(es["log"].ww["datetime"], primitive=Day)
    to_test = [
        (EqualScalar, [False, True]),
        (NotEqualScalar, [True, False]),
    ]

    features = []
    for test in to_test:
        features.append(Feature(day, primitive=test[0](10)))

    df = calculate_feature_matrix(entityset=es, features=features, instance_ids=[0, 14])

    for i, test in enumerate(to_test):
        v = df[features[i].get_name()].tolist()
        assert v == test[1]


def test_compare_of_agg(es):
    count_logs = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )

    to_test = [
        (EqualScalar, [False, False, False, True]),
        (NotEqualScalar, [True, True, True, False]),
        (LessThanScalar, [False, False, True, False]),
        (LessThanEqualToScalar, [False, False, True, True]),
        (GreaterThanScalar, [True, True, False, False]),
        (GreaterThanEqualToScalar, [True, True, False, True]),
    ]

    features = []
    for test in to_test:
        features.append(Feature(count_logs, primitive=test[0](2)))

    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=[0, 1, 2, 3],
    )

    for i, test in enumerate(to_test):
        v = df[features[i].get_name()].tolist()
        assert v == test[1]


def test_compare_all_nans(es):
    nan_feat = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="sessions",
        primitive=Mode,
    )
    compare = nan_feat == "brown bag"

    # before all data
    time_last = pd.Timestamp("1/1/1993")

    df = calculate_feature_matrix(
        entityset=es,
        features=[nan_feat, compare],
        instance_ids=[0, 1, 2],
        cutoff_time=time_last,
    )

    assert df[nan_feat.get_name()].dropna().shape[0] == 0
    assert not df[compare.get_name()].any()


def test_arithmetic_of_val(es):
    to_test = [
        (AddNumericScalar, [2.0, 7.0, 12.0, 17.0]),
        (SubtractNumericScalar, [-2.0, 3.0, 8.0, 13.0]),
        (ScalarSubtractNumericFeature, [2.0, -3.0, -8.0, -13.0]),
        (MultiplyNumericScalar, [0, 10, 20, 30]),
        (DivideNumericScalar, [0, 2.5, 5, 7.5]),
        (DivideByFeature, [np.inf, 0.4, 0.2, 2 / 15.0]),
    ]

    features = []
    for test in to_test:
        features.append(Feature(es["log"].ww["value"], primitive=test[0](2)))

    features.append(Feature(es["log"].ww["value"]) / 0)

    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=[0, 1, 2, 3],
    )

    for f, test in zip(features, to_test):
        v = df[f.get_name()].tolist()
        assert v == test[1]

    test = [np.nan, np.inf, np.inf, np.inf]
    v = df[features[-1].get_name()].tolist()
    assert np.isnan(v[0])
    assert v[1:] == test[1:]


def test_arithmetic_two_vals_fails(es):
    error_text = "Not a feature"
    with pytest.raises(Exception, match=error_text):
        Feature([2, 2], primitive=AddNumeric)


def test_arithmetic_of_identity(es):
    to_test = [
        (AddNumeric, [0.0, 7.0, 14.0, 21.0]),
        (SubtractNumeric, [0, 3, 6, 9]),
        (MultiplyNumeric, [0, 10, 40, 90]),
        (DivideNumeric, [np.nan, 2.5, 2.5, 2.5]),
    ]

    features = []
    for test in to_test:
        features.append(
            Feature(
                [
                    Feature(es["log"].ww["value"]),
                    Feature(es["log"].ww["value_2"]),
                ],
                primitive=test[0],
            ),
        )

    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=[0, 1, 2, 3],
    )

    for i, test in enumerate(to_test[:-1]):
        v = df[features[i].get_name()].tolist()
        assert v == test[1]
    i, test = -1, to_test[-1]
    v = df[features[i].get_name()].tolist()
    assert np.isnan(v[0])
    assert v[1:] == test[1][1:]


def test_arithmetic_of_direct(es):
    rating = Feature(es["products"].ww["rating"])
    log_rating = Feature(rating, "log")
    customer_age = Feature(es["customers"].ww["age"])
    session_age = Feature(customer_age, "sessions")
    log_age = Feature(session_age, "log")

    to_test = [
        (AddNumeric, [38, 37, 37.5, 37.5]),
        (SubtractNumeric, [28, 29, 28.5, 28.5]),
        (MultiplyNumeric, [165, 132, 148.5, 148.5]),
        (DivideNumeric, [6.6, 8.25, 22.0 / 3, 22.0 / 3]),
    ]

    features = []
    for test in to_test:
        features.append(Feature([log_age, log_rating], primitive=test[0]))

    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=[0, 3, 5, 7],
    )

    for i, test in enumerate(to_test):
        v = df[features[i].get_name()].tolist()
        assert v == test[1]


@pytest.fixture
def boolean_mult_es():
    es = EntitySet()
    df = pd.DataFrame(
        {
            "index": [0, 1, 2],
            "bool": pd.Series([True, False, True]),
            "numeric": [2, 3, np.nan],
        },
    )

    es.add_dataframe(
        dataframe_name="test",
        dataframe=df,
        index="index",
        logical_types={"numeric": Double},
    )

    return es


def test_boolean_multiply(boolean_mult_es):
    es = boolean_mult_es
    to_test = [
        ("numeric", "numeric"),
        ("numeric", "bool"),
        ("bool", "numeric"),
        ("bool", "bool"),
    ]
    features = []
    for row in to_test:
        features.append(Feature(es["test"].ww[row[0]]) * Feature(es["test"].ww[row[1]]))

    fm = calculate_feature_matrix(entityset=es, features=features)

    df = es["test"]

    for row in to_test:
        col_name = "{} * {}".format(row[0], row[1])
        if row[0] == "bool" and row[1] == "bool":
            assert fm[col_name].equals((df[row[0]] & df[row[1]]).astype("boolean"))
        else:
            assert fm[col_name].equals(df[row[0]] * df[row[1]])


def test_arithmetic_of_transform(es):
    diff1 = Feature([Feature(es["log"].ww["value"])], primitive=Diff)
    diff2 = Feature([Feature(es["log"].ww["value_2"])], primitive=Diff)

    to_test = [
        (AddNumeric, [np.nan, 7.0, -7.0, 10.0]),
        (SubtractNumeric, [np.nan, 3.0, -3.0, 4.0]),
        (MultiplyNumeric, [np.nan, 10.0, 10.0, 21.0]),
        (DivideNumeric, [np.nan, 2.5, 2.5, 2.3333333333333335]),
    ]

    features = []
    for test in to_test:
        features.append(Feature([diff1, diff2], primitive=test[0]()))

    feature_set = FeatureSet(features)
    calculator = FeatureSetCalculator(es, feature_set=feature_set)
    df = calculator.run(np.array([0, 2, 12, 13]))
    for i, test in enumerate(to_test):
        v = df[features[i].get_name()].tolist()
        assert np.isnan(v.pop(0))
        assert np.isnan(test[1].pop(0))
        assert v == test[1]


def test_not_feature(es):
    not_feat = Feature(es["customers"].ww["loves_ice_cream"], primitive=Not)
    features = [not_feat]
    df = calculate_feature_matrix(entityset=es, features=features, instance_ids=[0, 1])
    v = df[not_feat.get_name()].values
    assert not v[0]
    assert v[1]


def test_arithmetic_of_agg(es):
    customer_id_feat = Feature(es["customers"].ww["id"])
    store_id_feat = Feature(es["stores"].ww["id"])
    count_customer = Feature(
        customer_id_feat,
        parent_dataframe_name="régions",
        primitive=Count,
    )
    count_stores = Feature(
        store_id_feat,
        parent_dataframe_name="régions",
        primitive=Count,
    )
    to_test = [
        (AddNumeric, [6, 2]),
        (SubtractNumeric, [0, -2]),
        (MultiplyNumeric, [9, 0]),
        (DivideNumeric, [1, 0]),
    ]

    features = []
    for test in to_test:
        features.append(Feature([count_customer, count_stores], primitive=test[0]()))

    ids = ["United States", "Mexico"]
    df = calculate_feature_matrix(entityset=es, features=features, instance_ids=ids)
    df = df.loc[ids]

    for i, test in enumerate(to_test):
        v = df[features[i].get_name()].tolist()
        assert v == test[1]


def test_latlong(es):
    log_latlong_feat = Feature(es["log"].ww["latlong"])
    latitude = Feature(log_latlong_feat, primitive=Latitude)
    longitude = Feature(log_latlong_feat, primitive=Longitude)
    features = [latitude, longitude]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(15),
    )
    latvalues = df[latitude.get_name()].values
    lonvalues = df[longitude.get_name()].values
    assert len(latvalues) == 15
    assert len(lonvalues) == 15
    real_lats = [0, 5, 10, 15, 20, 0, 1, 2, 3, 0, 0, 5, 0, 7, 14]
    real_lons = [0, 2, 4, 6, 8, 0, 1, 2, 3, 0, 0, 2, 0, 3, 6]
    for (
        i,
        v,
    ) in enumerate(real_lats):
        assert v == latvalues[i]
    for (
        i,
        v,
    ) in enumerate(real_lons):
        assert v == lonvalues[i]


def test_latlong_with_nan(es):
    df = es["log"]
    df["latlong"][0] = np.nan
    df["latlong"][1] = (10, np.nan)
    df["latlong"][2] = (np.nan, 4)
    df["latlong"][3] = (np.nan, np.nan)
    es.replace_dataframe(dataframe_name="log", df=df)
    log_latlong_feat = Feature(es["log"].ww["latlong"])
    latitude = Feature(log_latlong_feat, primitive=Latitude)
    longitude = Feature(log_latlong_feat, primitive=Longitude)
    features = [latitude, longitude]
    fm = calculate_feature_matrix(entityset=es, features=features)
    latvalues = fm[latitude.get_name()].values
    lonvalues = fm[longitude.get_name()].values
    assert len(latvalues) == 17
    assert len(lonvalues) == 17
    real_lats = [
        np.nan,
        10,
        np.nan,
        np.nan,
        20,
        0,
        1,
        2,
        3,
        0,
        0,
        5,
        0,
        7,
        14,
        np.nan,
        np.nan,
    ]
    real_lons = [
        np.nan,
        np.nan,
        4,
        np.nan,
        8,
        0,
        1,
        2,
        3,
        0,
        0,
        2,
        0,
        3,
        6,
        np.nan,
        np.nan,
    ]
    assert np.allclose(latvalues, real_lats, atol=0.0001, equal_nan=True)
    assert np.allclose(lonvalues, real_lons, atol=0.0001, equal_nan=True)


def test_haversine(es):
    log_latlong_feat = Feature(es["log"].ww["latlong"])
    log_latlong_feat2 = Feature(es["log"].ww["latlong2"])
    haversine = Feature([log_latlong_feat, log_latlong_feat2], primitive=Haversine)
    features = [haversine]

    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(15),
    )
    values = df[haversine.get_name()].values
    real = [
        0,
        525.318462,
        1045.32190304,
        1554.56176802,
        2047.3294327,
        0,
        138.16578931,
        276.20524822,
        413.99185444,
        0,
        0,
        525.318462,
        0,
        741.57941183,
        1467.52760175,
    ]
    assert len(values) == 15
    assert np.allclose(values, real, atol=0.0001)

    haversine = Feature(
        [log_latlong_feat, log_latlong_feat2],
        primitive=Haversine(unit="kilometers"),
    )
    features = [haversine]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(15),
    )
    values = df[haversine.get_name()].values
    real_km = [
        0,
        845.41812212,
        1682.2825471,
        2501.82467535,
        3294.85736668,
        0,
        222.35628593,
        444.50926278,
        666.25531268,
        0,
        0,
        845.41812212,
        0,
        1193.45638714,
        2361.75676089,
    ]
    assert len(values) == 15
    assert np.allclose(values, real_km, atol=0.0001)
    error_text = "Invalid unit inches provided. Must be one of"
    with pytest.raises(ValueError, match=error_text):
        Haversine(unit="inches")


def test_haversine_with_nan(es):
    # Check some `nan` values
    df = es["log"]
    df["latlong"][0] = np.nan
    df["latlong"][1] = (10, np.nan)
    es.replace_dataframe(dataframe_name="log", df=df)
    log_latlong_feat = Feature(es["log"].ww["latlong"])
    log_latlong_feat2 = Feature(es["log"].ww["latlong2"])
    haversine = Feature([log_latlong_feat, log_latlong_feat2], primitive=Haversine)
    features = [haversine]

    df = calculate_feature_matrix(entityset=es, features=features)
    values = df[haversine.get_name()].values
    real = [
        np.nan,
        np.nan,
        1045.32190304,
        1554.56176802,
        2047.3294327,
        0,
        138.16578931,
        276.20524822,
        413.99185444,
        0,
        0,
        525.318462,
        0,
        741.57941183,
        1467.52760175,
        np.nan,
        np.nan,
    ]

    assert np.allclose(values, real, atol=0.0001, equal_nan=True)

    # Check all `nan` values
    df = es["log"]
    df["latlong2"] = np.nan
    es.replace_dataframe(dataframe_name="log", df=df)
    log_latlong_feat = Feature(es["log"].ww["latlong"])
    log_latlong_feat2 = Feature(es["log"].ww["latlong2"])
    haversine = Feature([log_latlong_feat, log_latlong_feat2], primitive=Haversine)
    features = [haversine]

    df = calculate_feature_matrix(entityset=es, features=features)
    values = df[haversine.get_name()].values
    real = [np.nan] * es["log"].shape[0]

    assert np.allclose(values, real, atol=0.0001, equal_nan=True)


def test_text_primitives(es):
    words = Feature(es["log"].ww["comments"], primitive=NumWords)
    chars = Feature(es["log"].ww["comments"], primitive=NumCharacters)

    features = [words, chars]

    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(15),
    )

    word_counts = [532, 3, 3, 653, 1306, 1305, 174, 173, 79, 246, 1253, 3, 3, 3, 3]
    char_counts = [
        3392,
        10,
        10,
        4116,
        7961,
        7580,
        992,
        957,
        437,
        1325,
        6322,
        10,
        10,
        10,
        10,
    ]
    word_values = df[words.get_name()].values
    char_values = df[chars.get_name()].values
    assert len(word_values) == 15
    for i, v in enumerate(word_values):
        assert v == word_counts[i]
    for i, v in enumerate(char_values):
        assert v == char_counts[i]


def test_isin_feat(es):
    isin = Feature(
        es["log"].ww["product_id"],
        primitive=IsIn(list_of_outputs=["toothpaste", "coke zero"]),
    )
    features = [isin]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(8),
    )
    true = [True, True, True, False, False, True, True, True]
    v = df[isin.get_name()].tolist()
    assert true == v


def test_isin_feat_other_syntax(es):
    isin = Feature(es["log"].ww["product_id"]).isin(["toothpaste", "coke zero"])
    features = [isin]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(8),
    )
    true = [True, True, True, False, False, True, True, True]
    v = df[isin.get_name()].tolist()
    assert true == v


def test_isin_feat_other_syntax_int(es):
    isin = Feature(es["log"].ww["value"]).isin([5, 10])
    features = [isin]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(8),
    )
    true = [False, True, True, False, False, False, False, False]
    v = df[isin.get_name()].tolist()
    assert true == v


def test_isin_feat_custom(es):
    class CustomIsIn(TransformPrimitive):
        name = "is_in"
        input_types = [ColumnSchema()]
        return_type = ColumnSchema(logical_type=Boolean)

        def __init__(self, list_of_outputs=None):
            self.list_of_outputs = list_of_outputs

        def get_function(self):
            def pd_is_in(array):
                return array.isin(self.list_of_outputs)

            return pd_is_in

    isin = Feature(
        es["log"].ww["product_id"],
        primitive=CustomIsIn(list_of_outputs=["toothpaste", "coke zero"]),
    )
    features = [isin]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(8),
    )
    true = [True, True, True, False, False, True, True, True]
    v = df[isin.get_name()].tolist()
    assert true == v

    isin = Feature(es["log"].ww["product_id"]).isin(["toothpaste", "coke zero"])
    features = [isin]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(8),
    )
    true = [True, True, True, False, False, True, True, True]
    v = df[isin.get_name()].tolist()
    assert true == v

    isin = Feature(es["log"].ww["value"]).isin([5, 10])
    features = [isin]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(8),
    )
    true = [False, True, True, False, False, False, False, False]
    v = df[isin.get_name()].tolist()
    assert true == v


def test_isnull_feat(es):
    value = Feature(es["log"].ww["value"])
    diff = Feature(
        value,
        groupby=Feature(es["log"].ww["session_id"]),
        primitive=Diff,
    )
    isnull = Feature(diff, primitive=IsNull)
    features = [isnull]
    df = calculate_feature_matrix(
        entityset=es,
        features=features,
        instance_ids=range(15),
    )

    correct_vals = [
        True,
        False,
        False,
        False,
        False,
        True,
        False,
        False,
        False,
        True,
        True,
        False,
        True,
        False,
        False,
    ]
    values = df[isnull.get_name()].tolist()
    assert correct_vals == values


def test_percentile(es):
    v = Feature(es["log"].ww["value"])
    p = Feature(v, primitive=Percentile)
    feature_set = FeatureSet([p])
    calculator = FeatureSetCalculator(es, feature_set)
    df = calculator.run(np.array(range(10, 17)))
    true = es["log"][v.get_name()].rank(pct=True)
    true = true.loc[range(10, 17)]
    for t, a in zip(true.values, df[p.get_name()].values):
        assert (pd.isnull(t) and pd.isnull(a)) or t == a


def test_dependent_percentile(es):
    v = Feature(es["log"].ww["value"])
    p = Feature(v, primitive=Percentile)
    p2 = Feature(p - 1, primitive=Percentile)
    feature_set = FeatureSet([p, p2])
    calculator = FeatureSetCalculator(es, feature_set)
    df = calculator.run(np.array(range(10, 17)))
    true = es["log"][v.get_name()].rank(pct=True)
    true = true.loc[range(10, 17)]
    for t, a in zip(true.values, df[p.get_name()].values):
        assert (pd.isnull(t) and pd.isnull(a)) or t == a


def test_agg_percentile(es):
    v = Feature(es["log"].ww["value"])
    p = Feature(v, primitive=Percentile)
    agg = Feature(p, parent_dataframe_name="sessions", primitive=Sum)
    feature_set = FeatureSet([agg])
    calculator = FeatureSetCalculator(es, feature_set)
    df = calculator.run(np.array([0, 1]))
    log_vals = es["log"][[v.get_name(), "session_id"]]
    log_vals["percentile"] = log_vals[v.get_name()].rank(pct=True)
    true_p = log_vals.groupby("session_id")["percentile"].sum()[[0, 1]]
    for t, a in zip(true_p.values, df[agg.get_name()].values):
        assert (pd.isnull(t) and pd.isnull(a)) or t == a


def test_percentile_agg_percentile(es):
    v = Feature(es["log"].ww["value"])
    p = Feature(v, primitive=Percentile)
    agg = Feature(p, parent_dataframe_name="sessions", primitive=Sum)
    pagg = Feature(agg, primitive=Percentile)
    feature_set = FeatureSet([pagg])
    calculator = FeatureSetCalculator(es, feature_set)
    df = calculator.run(np.array([0, 1]))

    log_vals = es["log"][[v.get_name(), "session_id"]]
    log_vals["percentile"] = log_vals[v.get_name()].rank(pct=True)
    true_p = log_vals.groupby("session_id")["percentile"].sum().fillna(0)
    true_p = true_p.rank(pct=True)[[0, 1]]

    for t, a in zip(true_p.values, df[pagg.get_name()].values):
        assert (pd.isnull(t) and pd.isnull(a)) or t == a


def test_percentile_agg(es):
    v = Feature(es["log"].ww["value"])
    agg = Feature(v, parent_dataframe_name="sessions", primitive=Sum)
    pagg = Feature(agg, primitive=Percentile)
    feature_set = FeatureSet([pagg])
    calculator = FeatureSetCalculator(es, feature_set)
    df = calculator.run(np.array([0, 1]))

    log_vals = es["log"][[v.get_name(), "session_id"]]
    true_p = log_vals.groupby("session_id")[v.get_name()].sum().fillna(0)
    true_p = true_p.rank(pct=True)[[0, 1]]

    for t, a in zip(true_p.values, df[pagg.get_name()].values):
        assert (pd.isnull(t) and pd.isnull(a)) or t == a


def test_direct_percentile(es):
    v = Feature(es["customers"].ww["age"])
    p = Feature(v, primitive=Percentile)
    d = Feature(p, "sessions")
    feature_set = FeatureSet([d])
    calculator = FeatureSetCalculator(es, feature_set)
    df = calculator.run(np.array([0, 1]))

    cust_vals = es["customers"][[v.get_name()]]
    cust_vals["percentile"] = cust_vals[v.get_name()].rank(pct=True)
    true_p = cust_vals["percentile"].loc[[0, 0]]
    for t, a in zip(true_p.values, df[d.get_name()].values):
        assert (pd.isnull(t) and pd.isnull(a)) or t == a


def test_direct_agg_percentile(es):
    v = Feature(es["log"].ww["value"])
    p = Feature(v, primitive=Percentile)
    agg = Feature(p, parent_dataframe_name="customers", primitive=Sum)
    d = Feature(agg, "sessions")
    feature_set = FeatureSet([d])
    calculator = FeatureSetCalculator(es, feature_set)
    df = calculator.run(np.array([0, 1]))

    log_vals = es["log"][[v.get_name(), "session_id"]]
    log_vals["percentile"] = log_vals[v.get_name()].rank(pct=True)
    log_vals["customer_id"] = [0] * 10 + [1] * 5 + [2] * 2
    true_p = log_vals.groupby("customer_id")["percentile"].sum().fillna(0)
    true_p = true_p[[0, 0]]
    for t, a in zip(true_p.values, df[d.get_name()].values):
        assert (pd.isnull(t) and pd.isnull(a)) or round(t, 3) == round(a, 3)


def test_percentile_with_cutoff(es):
    v = Feature(es["log"].ww["value"])
    p = Feature(v, primitive=Percentile)
    feature_set = FeatureSet([p])
    calculator = FeatureSetCalculator(
        es,
        feature_set,
        pd.Timestamp("2011/04/09 10:30:13"),
    )
    df = calculator.run(np.array([2]))
    assert df[p.get_name()].tolist()[0] == 1.0


def test_two_kinds_of_dependents(es):
    v = Feature(es["log"].ww["value"])
    product = Feature(es["log"].ww["product_id"])
    agg = Feature(
        v,
        parent_dataframe_name="customers",
        where=product == "coke zero",
        primitive=Sum,
    )
    p = Feature(agg, primitive=Percentile)
    g = Feature(agg, primitive=Absolute)
    agg2 = Feature(
        v,
        parent_dataframe_name="sessions",
        where=product == "coke zero",
        primitive=Sum,
    )
    agg3 = Feature(agg2, parent_dataframe_name="customers", primitive=Sum)
    feature_set = FeatureSet([p, g, agg3])
    calculator = FeatureSetCalculator(es, feature_set)
    df = calculator.run(np.array([0, 1]))
    assert df[p.get_name()].tolist() == [2.0 / 3, 1.0]
    assert df[g.get_name()].tolist() == [15, 26]


def test_get_filepath(es):
    class Mod4(TransformPrimitive):
        """Return base feature modulo 4"""

        name = "mod4"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})

        def get_function(self):
            filepath = self.get_filepath("featuretools_unit_test_example.csv")
            reference = pd.read_csv(filepath, header=None).squeeze("columns")

            def map_to_word(x):
                def _map(x):
                    if pd.isnull(x):
                        return x
                    return reference[int(x) % 4]

                return x.apply(_map)

            return map_to_word

    feat = Feature(es["log"].ww["value"], primitive=Mod4)
    df = calculate_feature_matrix(features=[feat], entityset=es, instance_ids=range(17))
    assert pd.isnull(df["MOD4(value)"][15])
    assert df["MOD4(value)"][0] == 0
    assert df["MOD4(value)"][14] == 2

    fm, fl = dfs(
        entityset=es,
        target_dataframe_name="log",
        agg_primitives=[],
        trans_primitives=[Mod4],
    )
    assert fm["MOD4(value)"][0] == 0
    assert fm["MOD4(value)"][14] == 2
    assert pd.isnull(fm["MOD4(value)"][15])


def test_override_multi_feature_names(es):
    def gen_custom_names(primitive, base_feature_names):
        return [
            "Above18(%s)" % base_feature_names,
            "Above21(%s)" % base_feature_names,
            "Above65(%s)" % base_feature_names,
        ]

    class IsGreater(TransformPrimitive):
        name = "is_greater"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        number_output_features = 3

        def get_function(self):
            def is_greater(x):
                return x > 18, x > 21, x > 65

            return is_greater

        def generate_names(primitive, base_feature_names):
            return gen_custom_names(primitive, base_feature_names)

    fm, features = dfs(
        entityset=es,
        target_dataframe_name="customers",
        instance_ids=[0, 1, 2],
        agg_primitives=[],
        trans_primitives=[IsGreater],
    )

    expected_names = gen_custom_names(IsGreater, ["age"])

    for name in expected_names:
        assert name in fm.columns


def test_time_since_primitive_matches_all_datetime_types(es):
    fm, fl = dfs(
        target_dataframe_name="customers",
        entityset=es,
        trans_primitives=[TimeSince],
        agg_primitives=[],
        max_depth=1,
    )

    customers_datetime_cols = [
        id
        for id, t in es["customers"].ww.logical_types.items()
        if isinstance(t, Datetime)
    ]
    expected_names = [f"TIME_SINCE({v})" for v in customers_datetime_cols]

    for name in expected_names:
        assert name in fm.columns


def test_cfm_with_numeric_lag_and_non_nullable_column(es):
    # fill nans so we can use non nullable numeric logical type in the EntitySet
    new_log = es["log"].copy()
    new_log["value"] = new_log["value"].fillna(0)
    new_log.ww.init(
        logical_types={"value": "Integer", "product_id": "Categorical"},
        index="id",
        time_index="datetime",
        name="new_log",
    )
    es.add_dataframe(new_log)
    rels = [
        ("sessions", "id", "new_log", "session_id"),
        ("products", "id", "new_log", "product_id"),
    ]
    es = es.add_relationships(rels)

    assert isinstance(es["new_log"].ww.logical_types["value"], Integer)

    periods = 5
    lag_primitive = NumericLag(periods=periods)
    cutoff_times = es["new_log"][["id", "datetime"]]
    fm, _ = dfs(
        target_dataframe_name="new_log",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[lag_primitive],
        cutoff_time=cutoff_times,
    )
    assert fm["NUMERIC_LAG(datetime, value, periods=5)"].head(periods).isnull().all()
    assert fm["NUMERIC_LAG(datetime, value, periods=5)"].isnull().sum() == periods

    assert "NUMERIC_LAG(datetime, value_2, periods=5)" in fm.columns

    assert "NUMERIC_LAG(datetime, products.rating, periods=5)" in fm.columns
    assert (
        fm["NUMERIC_LAG(datetime, products.rating, periods=5)"]
        .head(periods)
        .isnull()
        .all()
    )


def test_cfm_with_lag_and_non_nullable_columns(es):
    # fill nans so we can use non nullable numeric logical type in the EntitySet
    new_log = es["log"].copy()
    new_log["value"] = new_log["value"].fillna(0)
    new_log["value_double"] = new_log["value"]
    new_log["purchased_with_nulls"] = new_log["purchased"]
    new_log["purchased_with_nulls"][0:4] = None
    new_log.ww.init(
        logical_types={
            "value": "Integer",
            "value_2": "IntegerNullable",
            "product_id": "Categorical",
            "value_double": "Double",
            "purchased_with_nulls": "BooleanNullable",
        },
        index="id",
        time_index="datetime",
        name="new_log",
    )
    es.add_dataframe(new_log)
    rels = [
        ("sessions", "id", "new_log", "session_id"),
        ("products", "id", "new_log", "product_id"),
    ]
    es = es.add_relationships(rels)

    assert isinstance(es["new_log"].ww.logical_types["value"], Integer)

    periods = 5
    lag_primitive = Lag(periods=periods)
    cutoff_times = es["new_log"][["id", "datetime"]]
    fm, _ = dfs(
        target_dataframe_name="new_log",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[lag_primitive],
        cutoff_time=cutoff_times,
    )
    # Integer
    assert fm["LAG(value, datetime, periods=5)"].head(periods).isnull().all()
    assert fm["LAG(value, datetime, periods=5)"].isnull().sum() == periods
    assert isinstance(
        fm.ww.schema.logical_types["LAG(value, datetime, periods=5)"],
        IntegerNullable,
    )

    # IntegerNullable
    assert "LAG(value_2, datetime, periods=5)" in fm.columns
    assert fm["LAG(value_2, datetime, periods=5)"].head(periods).isnull().all()
    assert isinstance(
        fm.ww.schema.logical_types["LAG(value_2, datetime, periods=5)"],
        IntegerNullable,
    )

    # Categorical
    assert "LAG(product_id, datetime, periods=5)" in fm.columns
    assert fm["LAG(product_id, datetime, periods=5)"].head(periods).isnull().all()
    assert isinstance(
        fm.ww.schema.logical_types["LAG(product_id, datetime, periods=5)"],
        Categorical,
    )

    # Double
    assert "LAG(value_double, datetime, periods=5)" in fm.columns
    assert fm["LAG(value_double, datetime, periods=5)"].head(periods).isnull().all()
    assert isinstance(
        fm.ww.schema.logical_types["LAG(value_double, datetime, periods=5)"],
        Double,
    )

    # Boolean
    assert "LAG(purchased, datetime, periods=5)" in fm.columns
    assert fm["LAG(purchased, datetime, periods=5)"].head(periods).isnull().all()
    assert isinstance(
        fm.ww.schema.logical_types["LAG(purchased, datetime, periods=5)"],
        BooleanNullable,
    )

    # BooleanNullable
    assert "LAG(purchased_with_nulls, datetime, periods=5)" in fm.columns
    assert (
        fm["LAG(purchased_with_nulls, datetime, periods=5)"]
        .head(periods)
        .isnull()
        .all()
    )
    assert isinstance(
        fm.ww.schema.logical_types["LAG(purchased_with_nulls, datetime, periods=5)"],
        BooleanNullable,
    )


def test_comparisons_with_ordinal_valid_inputs_that_dont_work_but_should(es):
    # TODO: Remvoe this test once the correct behavior is implemented in CFM
    # The following test covers a scenario where an intermediate feature doesn't have the correct type
    # because Woodwork has not yet been initialized. This calculation should work and return valid True/False
    # values. This should be fixed in a future PR, but until a fix is implemented null values are returned to
    # prevent calculate_feature_matrix from raising an Error when calculating features generated by DFS.

    priority_level = Feature(es["log"].ww["priority_level"])
    first_priority = AggregationFeature(
        priority_level,
        parent_dataframe_name="customers",
        primitive=First,
    )
    engagement = Feature(es["customers"].ww["engagement_level"])
    invalid_but_should_be_valid = [
        TransformFeature([engagement, first_priority], primitive=LessThan),
        TransformFeature([engagement, first_priority], primitive=LessThanEqualTo),
        TransformFeature([engagement, first_priority], primitive=GreaterThan),
        TransformFeature([engagement, first_priority], primitive=GreaterThanEqualTo),
    ]
    fm = calculate_feature_matrix(
        entityset=es,
        features=invalid_but_should_be_valid,
    )

    feature_cols = [f.get_name() for f in invalid_but_should_be_valid]
    for col in feature_cols:
        assert fm[col].isnull().all()


def test_multiply_numeric_boolean():
    test_cases = [
        {"val": 100, "mask": True, "expected": 100},
        {"val": 100, "mask": False, "expected": 0},
        {"val": 0, "mask": False, "expected": 0},
        {"val": 100, "mask": pd.NA, "expected": pd.NA},
        {"val": pd.NA, "mask": pd.NA, "expected": pd.NA},
        {"val": pd.NA, "mask": True, "expected": pd.NA},
        {"val": pd.NA, "mask": False, "expected": pd.NA},
    ]

    multiply_numeric_boolean = MultiplyNumericBoolean()
    for input in test_cases:
        vals = pd.Series(input["val"]).astype("Int64")
        mask = pd.Series(input["mask"])
        actual = multiply_numeric_boolean(vals, mask).tolist()[0]
        expected = input["expected"]
        if pd.isnull(expected):
            assert pd.isnull(actual)
        else:
            assert actual == input["expected"]


def test_multiply_numeric_boolean_multiple_dtypes_no_nulls():
    # Test without null values
    vals = pd.Series([1, 2, 3])
    bools = pd.Series([True, False, True])
    multiply_numeric_boolean = MultiplyNumericBoolean()
    numeric_dtypes = ["float64", "int64", "Int64"]
    boolean_dtypes = ["bool", "boolean"]

    for numeric_dtype in numeric_dtypes:
        for boolean_dtype in boolean_dtypes:
            actual = multiply_numeric_boolean(
                vals.astype(numeric_dtype),
                bools.astype(boolean_dtype),
            )
            expected = pd.Series([1, 0, 3])
            pd.testing.assert_series_equal(actual, expected, check_dtype=False)


def test_multiply_numeric_boolean_multiple_dtypes_with_nulls():
    # Test with null values
    vals = pd.Series([np.nan, 2, 3])
    bools = pd.Series([True, False, pd.NA], dtype="boolean")
    multiply_numeric_boolean = MultiplyNumericBoolean()
    numeric_dtypes = ["float64", "Int64"]

    for numeric_dtype in numeric_dtypes:
        actual = multiply_numeric_boolean(vals.astype(numeric_dtype), bools)
        expected = pd.Series([np.nan, 0, np.nan])
        pd.testing.assert_series_equal(actual, expected, check_dtype=False)


def test_feature_multiplication(es):
    numeric_ft = Feature(es["customers"].ww["age"])
    boolean_ft = Feature(es["customers"].ww["loves_ice_cream"])

    mult_numeric = numeric_ft * numeric_ft
    mult_boolean = boolean_ft * boolean_ft
    mult_numeric_boolean = numeric_ft * boolean_ft
    mult_numeric_boolean2 = boolean_ft * numeric_ft

    assert issubclass(type(mult_numeric.primitive), MultiplyNumeric)
    assert issubclass(type(mult_boolean.primitive), MultiplyBoolean)
    assert issubclass(type(mult_numeric_boolean.primitive), MultiplyNumericBoolean)
    assert issubclass(type(mult_numeric_boolean2.primitive), MultiplyNumericBoolean)

    # Test with nullable types
    es["customers"].ww.set_types(
        logical_types={"age": "IntegerNullable", "loves_ice_cream": "BooleanNullable"},
    )
    numeric_ft = Feature(es["customers"].ww["age"])
    boolean_ft = Feature(es["customers"].ww["loves_ice_cream"])
    mult_numeric = numeric_ft * numeric_ft
    mult_boolean = boolean_ft * boolean_ft
    mult_numeric_boolean = numeric_ft * boolean_ft
    mult_numeric_boolean2 = boolean_ft * numeric_ft

    assert issubclass(type(mult_numeric.primitive), MultiplyNumeric)
    assert issubclass(type(mult_boolean.primitive), MultiplyBoolean)
    assert issubclass(type(mult_numeric_boolean.primitive), MultiplyNumericBoolean)
    assert issubclass(type(mult_numeric_boolean2.primitive), MultiplyNumericBoolean)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/__init__.py
================================================


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_cumulative_time_since.py
================================================
from datetime import datetime

import numpy as np
import pandas as pd

from featuretools.primitives import (
    CumulativeTimeSinceLastFalse,
    CumulativeTimeSinceLastTrue,
)
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestCumulativeTimeSinceLastTrue(PrimitiveTestBase):
    primitive = CumulativeTimeSinceLastTrue
    booleans = pd.Series([False, True, False, True, False, False])
    datetimes = pd.Series(
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(len(booleans))],
    )
    answer = pd.Series([np.nan, 0, 6, 0, 6, 12])

    def test_regular(self):
        primitive_func = self.primitive().get_function()
        given_answer = primitive_func(self.datetimes, self.booleans)
        assert given_answer.equals(self.answer)

    def test_all_false(self):
        primitive_func = self.primitive().get_function()
        booleans = pd.Series([False, False, False])
        datetimes = pd.Series(
            [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(len(booleans))],
        )
        given_answer = primitive_func(datetimes, booleans)
        answer = pd.Series([np.nan] * 3)
        assert given_answer.equals(answer)

    def test_all_nan(self):
        primitive_func = self.primitive().get_function()
        datetimes = pd.Series([np.nan] * 4)
        booleans = pd.Series([np.nan] * 4)
        given_answer = primitive_func(datetimes, booleans)
        answer = pd.Series([np.nan] * 4)
        assert given_answer.equals(answer)

    def test_some_nans(self):
        primitive_func = self.primitive().get_function()
        booleans = pd.Series(
            [
                False,
                True,
                False,
                True,
                False,
                False,
                True,
                True,
                False,
                False,
            ],
        )
        datetimes = pd.Series([np.nan] * 2)
        datetimes = pd.concat([datetimes, self.datetimes])
        datetimes = pd.concat([datetimes, pd.Series([np.nan] * 2)])
        datetimes = datetimes.reset_index(drop=True)
        answer = pd.Series(
            [
                np.nan,
                np.nan,
                np.nan,
                0,
                6,
                12,
                0,
                0,
                np.nan,
                np.nan,
            ],
        )
        given_answer = primitive_func(datetimes, booleans)
        assert given_answer.equals(answer)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestCumulativeTimeSinceLastFalse(PrimitiveTestBase):
    primitive = CumulativeTimeSinceLastFalse
    booleans = pd.Series([True, False, True, False, True, True])
    datetimes = pd.Series(
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(len(booleans))],
    )
    answer = pd.Series([np.nan, 0, 6, 0, 6, 12])

    def test_regular(self):
        primitive_func = self.primitive().get_function()
        given_answer = primitive_func(self.datetimes, self.booleans)
        assert given_answer.equals(self.answer)

    def test_all_true(self):
        primitive_func = self.primitive().get_function()
        booleans = pd.Series([True, True, True])
        datetimes = pd.Series(
            [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(len(booleans))],
        )
        given_answer = primitive_func(datetimes, booleans)
        answer = pd.Series([np.nan] * 3)
        assert given_answer.equals(answer)

    def test_all_nan(self):
        primitive_func = self.primitive().get_function()
        datetimes = pd.Series([np.nan] * 4)
        booleans = pd.Series([np.nan] * 4)
        given_answer = primitive_func(datetimes, booleans)
        answer = pd.Series([np.nan] * 4)
        assert given_answer.equals(answer)

    def test_some_nans(self):
        primitive_func = self.primitive().get_function()
        booleans = pd.Series(
            [
                True,
                False,
                True,
                False,
                True,
                True,
                False,
                False,
                True,
                True,
            ],
        )
        datetimes = pd.Series([np.nan] * 2)
        datetimes = pd.concat([datetimes, self.datetimes])
        datetimes = pd.concat([datetimes, pd.Series([np.nan] * 2)])
        datetimes = datetimes.reset_index(drop=True)
        answer = pd.Series(
            [
                np.nan,
                np.nan,
                np.nan,
                0,
                6,
                12,
                0,
                0,
                np.nan,
                np.nan,
            ],
        )
        given_answer = primitive_func(datetimes, booleans)
        assert given_answer.equals(answer)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_datetoholiday_primitive.py
================================================
from datetime import datetime

import numpy as np
import pandas as pd
import pytest

from featuretools.primitives import DateToHoliday


def test_datetoholiday():
    date_to_holiday = DateToHoliday()

    dates = pd.Series(
        [
            datetime(2016, 1, 1),
            datetime(2016, 2, 27),
            datetime(2017, 5, 29, 10, 30, 5),
            datetime(2018, 7, 4),
        ],
    )

    holiday_series = date_to_holiday(dates).tolist()

    assert holiday_series[0] == "New Year's Day"
    assert np.isnan(holiday_series[1])
    assert holiday_series[2] == "Memorial Day"
    assert holiday_series[3] == "Independence Day"


def test_datetoholiday_error():
    error_text = r"must be one of the available countries.*"
    with pytest.raises(ValueError, match=error_text):
        DateToHoliday(country="UNK")


def test_nat():
    date_to_holiday = DateToHoliday()
    case = pd.Series(
        [
            "2019-10-14",
            "NaT",
            "2016-02-15",
            "NaT",
        ],
    ).astype("datetime64[ns]")
    answer = ["Columbus Day", np.nan, "Washington's Birthday", np.nan]
    given_answer = date_to_holiday(case).astype("str")
    np.testing.assert_array_equal(given_answer, answer)


def test_valid_country():
    date_to_holiday = DateToHoliday(country="Canada")
    case = pd.Series(
        [
            "2016-07-01",
            "2016-11-11",
            "2018-12-25",
        ],
    ).astype("datetime64[ns]")
    answer = ["Canada Day", np.nan, "Christmas Day"]
    given_answer = date_to_holiday(case).astype("str")
    np.testing.assert_array_equal(given_answer, answer)


def test_multiple_countries():
    dth_mexico = DateToHoliday(country="Mexico")

    case = pd.Series([datetime(2000, 9, 16), datetime(2005, 1, 1)])
    assert len(dth_mexico(case)) > 1

    dth_india = DateToHoliday(country="IND")
    case = pd.Series([datetime(2048, 1, 1), datetime(2048, 10, 2)])
    assert len(dth_india(case)) > 1

    dth_uk = DateToHoliday(country="UK")
    case = pd.Series([datetime(2048, 3, 17), datetime(2048, 4, 6)])
    assert len(dth_uk(case)) > 1

    countries = [
        "Argentina",
        "AU",
        "Austria",
        "BY",
        "Belgium",
        "Brazil",
        "Canada",
        "Colombia",
        "Croatia",
        "England",
        "Finland",
        "FRA",
        "Germany",
        "Germany",
        "Italy",
        "NewZealand",
        "PortugalExt",
        "PTE",
        "Spain",
        "ES",
        "Switzerland",
        "UnitedStates",
        "US",
        "UK",
        "UA",
        "CH",
        "SE",
        "ZA",
    ]
    for x in countries:
        DateToHoliday(country=x)


def test_with_timezone_aware_datetimes():
    df = pd.DataFrame(
        {
            "non_timezone_aware_with_time": pd.date_range(
                "2018-07-03 09:00",
                periods=3,
            ),
            "non_timezone_aware_no_time": pd.date_range("2018-07-03", periods=3),
            "timezone_aware_with_time": pd.date_range(
                "2018-07-03 09:00",
                periods=3,
            ).tz_localize(tz="US/Eastern"),
            "timezone_aware_no_time": pd.date_range(
                "2018-07-03",
                periods=3,
            ).tz_localize(tz="US/Eastern"),
        },
    )

    date_to_holiday = DateToHoliday(country="US")
    expected = [np.nan, "Independence Day", np.nan]
    for col in df.columns:
        actual = date_to_holiday(df[col]).astype("str")
        np.testing.assert_array_equal(actual, expected)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_distancetoholiday_primitive.py
================================================
from datetime import datetime

import numpy as np
import pandas as pd
import pytest

from featuretools.primitives import DistanceToHoliday


def test_distanceholiday():
    distance_to_holiday = DistanceToHoliday("New Year's Day")
    dates = pd.Series(
        [
            datetime(2010, 1, 1),
            datetime(2012, 5, 31),
            datetime(2017, 7, 31),
            datetime(2020, 12, 31),
        ],
    )

    expected = [0, -151, 154, 1]
    output = distance_to_holiday(dates).tolist()
    np.testing.assert_array_equal(output, expected)


def test_unknown_country_error():
    error_text = r"must be one of the available countries.*"
    with pytest.raises(ValueError, match=error_text):
        DistanceToHoliday("Victoria Day", country="UNK")


def test_unknown_holiday_error():
    error_text = r"must be one of the available holidays.*"
    with pytest.raises(ValueError, match=error_text):
        DistanceToHoliday("Alteryx Day")


def test_nat():
    date_to_holiday = DistanceToHoliday("New Year's Day")
    case = pd.Series(
        [
            "2010-01-01",
            "NaT",
            "2012-05-31",
            "NaT",
        ],
    ).astype("datetime64[ns]")
    answer = [0, np.nan, -151, np.nan]
    given_answer = date_to_holiday(case).astype("float")
    np.testing.assert_array_equal(given_answer, answer)


def test_valid_country():
    distance_to_holiday = DistanceToHoliday("Canada Day", country="Canada")
    case = pd.Series(
        [
            "2010-01-01",
            "2012-05-31",
            "2017-07-31",
            "2020-12-31",
        ],
    ).astype("datetime64[ns]")
    answer = [181, 31, -30, 182]
    given_answer = distance_to_holiday(case).astype("float")
    np.testing.assert_array_equal(given_answer, answer)


def test_with_timezone_aware_datetimes():
    df = pd.DataFrame(
        {
            "non_timezone_aware_with_time": pd.date_range(
                "2018-07-03 09:00",
                periods=3,
            ),
            "non_timezone_aware_no_time": pd.date_range("2018-07-03", periods=3),
            "timezone_aware_with_time": pd.date_range(
                "2018-07-03 09:00",
                periods=3,
            ).tz_localize(tz="US/Eastern"),
            "timezone_aware_no_time": pd.date_range(
                "2018-07-03",
                periods=3,
            ).tz_localize(tz="US/Eastern"),
        },
    )

    distance_to_holiday = DistanceToHoliday("Independence Day", country="US")
    expected = [1, 0, -1]
    for col in df.columns:
        actual = distance_to_holiday(df[col])
        np.testing.assert_array_equal(actual, expected)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_expanding_primitives.py
================================================
import numpy as np
import pandas as pd
import pytest

from featuretools.primitives.standard.transform.time_series.expanding import (
    ExpandingCount,
    ExpandingMax,
    ExpandingMean,
    ExpandingMin,
    ExpandingSTD,
    ExpandingTrend,
)
from featuretools.primitives.standard.transform.time_series.utils import (
    _apply_gap_for_expanding_primitives,
)
from featuretools.utils import calculate_trend


@pytest.mark.parametrize(
    "min_periods, gap",
    [
        (5, 2),
        (5, 0),
        (0, 0),
    ],
)
def test_expanding_count_series(window_series, min_periods, gap):
    test = window_series.shift(gap)
    expected = test.expanding(min_periods=min_periods).count()
    num_nans = gap + min_periods - 1
    expected[range(num_nans)] = np.nan
    primitive_instance = ExpandingCount(min_periods=min_periods, gap=gap).get_function()
    actual = primitive_instance(window_series.index)
    pd.testing.assert_series_equal(pd.Series(actual), expected)


@pytest.mark.parametrize(
    "min_periods, gap",
    [
        (5, 2),
        (5, 0),
        (0, 0),
        (0, 1),
    ],
)
def test_expanding_count_date_range(window_date_range, min_periods, gap):
    test = _apply_gap_for_expanding_primitives(gap=gap, x=window_date_range)
    expected = test.expanding(min_periods=min_periods).count()
    num_nans = gap + min_periods - 1
    expected[range(num_nans)] = np.nan
    primitive_instance = ExpandingCount(min_periods=min_periods, gap=gap).get_function()
    actual = primitive_instance(window_date_range)
    pd.testing.assert_series_equal(pd.Series(actual), expected)


@pytest.mark.parametrize(
    "min_periods, gap",
    [
        (5, 2),
        (5, 0),
        (0, 0),
        (0, 1),
    ],
)
def test_expanding_min(window_series, min_periods, gap):
    test = window_series.shift(gap)
    expected = test.expanding(min_periods=min_periods).min().values
    primitive_instance = ExpandingMin(min_periods=min_periods, gap=gap).get_function()
    actual = primitive_instance(
        numeric=window_series,
        datetime=window_series.index,
    )
    pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected))


@pytest.mark.parametrize(
    "min_periods, gap",
    [
        (5, 2),
        (5, 0),
        (0, 0),
        (0, 1),
    ],
)
def test_expanding_max(window_series, min_periods, gap):
    test = window_series.shift(gap)
    expected = test.expanding(min_periods=min_periods).max().values
    primitive_instance = ExpandingMax(min_periods=min_periods, gap=gap).get_function()
    actual = primitive_instance(
        numeric=window_series,
        datetime=window_series.index,
    )
    pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected))


@pytest.mark.parametrize(
    "min_periods, gap",
    [
        (5, 2),
        (5, 0),
        (0, 0),
        (0, 1),
    ],
)
def test_expanding_std(window_series, min_periods, gap):
    test = window_series.shift(gap)
    expected = test.expanding(min_periods=min_periods).std().values
    primitive_instance = ExpandingSTD(min_periods=min_periods, gap=gap).get_function()
    actual = primitive_instance(
        numeric=window_series,
        datetime=window_series.index,
    )
    pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected))


@pytest.mark.parametrize(
    "min_periods, gap",
    [
        (5, 2),
        (5, 0),
        (0, 0),
        (0, 1),
    ],
)
def test_expanding_mean(window_series, min_periods, gap):
    test = window_series.shift(gap)
    expected = test.expanding(min_periods=min_periods).mean().values
    primitive_instance = ExpandingMean(min_periods=min_periods, gap=gap).get_function()
    actual = primitive_instance(
        numeric=window_series,
        datetime=window_series.index,
    )
    pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected))


@pytest.mark.parametrize(
    "min_periods, gap",
    [
        (5, 2),
        (5, 0),
        (0, 0),
        (0, 1),
    ],
)
def test_expanding_trend(window_series, min_periods, gap):
    test = window_series.shift(gap)
    expected = test.expanding(min_periods=min_periods).aggregate(calculate_trend).values
    primitive_instance = ExpandingTrend(min_periods=min_periods, gap=gap).get_function()
    actual = primitive_instance(
        numeric=window_series,
        datetime=window_series.index,
    )
    pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected))


@pytest.mark.parametrize(
    "primitive",
    [
        ExpandingMax,
        ExpandingMean,
        ExpandingMin,
        ExpandingSTD,
        ExpandingTrend,
    ],
)
def test_expanding_primitives_throw_error_when_given_string_offset(
    window_series,
    primitive,
):
    error_msg = (
        "String offsets are not supported for the gap parameter in Expanding primitives"
    )
    with pytest.raises(TypeError, match=error_msg):
        primitive(gap="2H").get_function()(
            numeric=window_series,
            datetime=window_series.index,
        )


def test_apply_gap_for_expanding_primitives_throws_error_when_given_string_offset(
    window_series,
):
    error_msg = (
        "String offsets are not supported for the gap parameter in Expanding primitives"
    )
    with pytest.raises(TypeError, match=error_msg):
        _apply_gap_for_expanding_primitives(window_series, gap="2H")


@pytest.mark.parametrize(
    "gap",
    [
        2,
        5,
        3,
        0,
    ],
)
def test_apply_gap_for_expanding_primitives(window_series, gap):
    actual = _apply_gap_for_expanding_primitives(window_series, gap).values
    expected = window_series.shift(gap).values
    pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected))


@pytest.mark.parametrize(
    "gap",
    [
        2,
        5,
        3,
        0,
    ],
)
def test_apply_gap_for_expanding_primitives_handles_date_range(
    window_date_range,
    gap,
):
    actual = pd.Series(
        _apply_gap_for_expanding_primitives(window_date_range, gap).values,
    )
    expected = pd.Series(window_date_range.to_series().shift(gap).values)
    pd.testing.assert_series_equal(actual, expected)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_exponential_primitives.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import (
    ExponentialWeightedAverage,
    ExponentialWeightedSTD,
    ExponentialWeightedVariance,
)


def test_regular_com_avg():
    primitive_instance = ExponentialWeightedAverage(com=0.5)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series([1.0, 1.75, 5.384615384615384, 5.125])
    pd.testing.assert_series_equal(answer, correct_answer)


def test_regular_span_avg():
    primitive_instance = ExponentialWeightedAverage(span=1.5)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series([1.0, 1.8333333333333335, 6.0, 5.198717948717948])
    pd.testing.assert_series_equal(answer, correct_answer)


def test_regular_halflife_avg():
    primitive_instance = ExponentialWeightedAverage(halflife=2.7)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series(
        [1.0, 1.563830114594977, 3.8556233149044865, 4.2592901785684205],
    )
    pd.testing.assert_series_equal(answer, correct_answer)


def test_regular_alpha_avg():
    primitive_instance = ExponentialWeightedAverage(alpha=0.8)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series([1.0, 1.8333333333333335, 6.0, 5.198717948717948])
    pd.testing.assert_series_equal(answer, correct_answer)


def test_na_avg():
    primitive_instance = ExponentialWeightedAverage(com=0.5)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, np.nan, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series(
        [1.0, 1.75, 5.384615384615384, 5.384615384615384, 5.053191489361702],
    )
    pd.testing.assert_series_equal(answer, correct_answer)


def test_ignorena_true_avg():
    primitive_instance = ExponentialWeightedAverage(com=0.5, ignore_na=True)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, np.nan, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series(
        [1.0, 1.75, 5.384615384615384, 5.384615384615384, 5.125],
    )
    pd.testing.assert_series_equal(answer, correct_answer)


def test_regular_com_std():
    primitive_instance = ExponentialWeightedSTD(com=0.5)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series(
        [np.nan, 0.7071067811865475, 3.584153156068229, 2.0048019276803304],
    )
    pd.testing.assert_series_equal(answer, correct_answer)


def test_regular_span_std():
    primitive_instance = ExponentialWeightedSTD(span=1.5)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series(
        [np.nan, 0.7071067811865476, 3.6055512754639887, 1.7311551816712718],
    )
    pd.testing.assert_series_equal(answer, correct_answer)


def test_regular_halflife_std():
    primitive_instance = ExponentialWeightedSTD(halflife=2.7)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series(
        [np.nan, 0.7071067811865475, 3.3565236098585416, 2.631776826295855],
    )
    pd.testing.assert_series_equal(answer, correct_answer)


def test_regular_alpha_std():
    primitive_instance = ExponentialWeightedSTD(alpha=0.8)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series(
        [np.nan, 0.7071067811865476, 3.6055512754639887, 1.7311551816712718],
    )
    pd.testing.assert_series_equal(answer, correct_answer)


def test_na_std():
    primitive_instance = ExponentialWeightedSTD(com=0.5)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, np.nan, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series(
        [
            np.nan,
            0.7071067811865475,
            3.584153156068229,
            3.5841531560682287,
            1.8408520483016189,
        ],
    )
    pd.testing.assert_series_equal(answer, correct_answer)


def test_ignorena_true_std():
    primitive_instance = ExponentialWeightedSTD(com=0.5, ignore_na=True)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, np.nan, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series(
        [
            np.nan,
            0.7071067811865475,
            3.584153156068229,
            3.584153156068229,
            2.0048019276803304,
        ],
    )
    pd.testing.assert_series_equal(answer, correct_answer)


def test_regular_com_var():
    primitive_instance = ExponentialWeightedVariance(com=0.5)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series(
        [np.nan, 0.49999999999999983, 12.846153846153847, 4.019230769230769],
    )
    pd.testing.assert_series_equal(answer, correct_answer)


def test_regular_span_var():
    primitive_instance = ExponentialWeightedVariance(span=1.5)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series([np.nan, 0.5, 12.999999999999996, 2.996898263027294])
    pd.testing.assert_series_equal(answer, correct_answer)


def test_regular_halflife_var():
    primitive_instance = ExponentialWeightedVariance(halflife=2.7)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series(
        [np.nan, 0.49999999999999994, 11.266250743537816, 6.926249263427883],
    )
    pd.testing.assert_series_equal(answer, correct_answer)


def test_regular_alpha_var():
    primitive_instance = ExponentialWeightedVariance(alpha=0.8)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series([np.nan, 0.5, 12.999999999999996, 2.996898263027294])
    pd.testing.assert_series_equal(answer, correct_answer)


def test_na_var():
    primitive_instance = ExponentialWeightedVariance(com=0.5)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, np.nan, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series(
        [
            np.nan,
            0.49999999999999983,
            12.846153846153847,
            12.846153846153843,
            3.3887362637362655,
        ],
    )
    pd.testing.assert_series_equal(answer, correct_answer)


def test_ignorena_true_var():
    primitive_instance = ExponentialWeightedVariance(com=0.5, ignore_na=True)
    primitive_func = primitive_instance.get_function()
    array = pd.Series([1, 2, 7, np.nan, 5])
    answer = pd.Series(primitive_func(array))
    correct_answer = pd.Series(
        [
            np.nan,
            0.49999999999999983,
            12.846153846153847,
            12.846153846153847,
            4.019230769230769,
        ],
    )
    pd.testing.assert_series_equal(answer, correct_answer)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_full_name_primitives.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import (
    FullNameToFirstName,
    FullNameToLastName,
    FullNameToTitle,
)
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestFullNameToFirstName(PrimitiveTestBase):
    primitive = FullNameToFirstName

    def test_urls(self):
        # note this implementation incorrectly identifies the first
        # name for 'Oliva y Ocana, Dona. Fermina'
        primitive_func = self.primitive().get_function()
        names = pd.Series(
            [
                "Spector, Mr. Woolf",
                "Oliva y Ocana, Dona. Fermina",
                "Saether, Mr. Simon Sivertsen",
                "Ware, Mr. Frederick",
                "Peter, Master. Michael J",
            ],
        )
        answer = pd.Series(["Woolf", "Oliva", "Simon", "Frederick", "Michael"])
        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)

    def test_no_title(self):
        primitive_func = self.primitive().get_function()
        names = pd.Series(
            [
                "Peter, Michael J",
                "James Masters",
                "Kate Elizabeth Brown-Jones",
            ],
        )
        answer = pd.Series(["Michael", "James", "Kate"], dtype=object)
        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)

    def test_empty_string(self):
        primitive_func = self.primitive().get_function()
        names = pd.Series(
            [
                "Peter, Michael J",
                "",
                "Kate Elizabeth Brown-Jones",
            ],
        )
        answer = pd.Series(["Michael", np.nan, "Kate"], dtype=object)
        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)

    def test_single_name(self):
        primitive_func = self.primitive().get_function()
        names = pd.Series(
            [
                "Peter, Michael J",
                "James",
                "Kate Elizabeth Brown-Jones",
            ],
        )
        answer = pd.Series(["Michael", "James", "Kate"], dtype=object)
        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)

    def test_nan(self):
        primitive_func = self.primitive().get_function()
        names = pd.Series(["Mr. James Brown", np.nan, None])
        answer = pd.Series(["James", np.nan, np.nan])
        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestFullNameToLastName(PrimitiveTestBase):
    primitive = FullNameToLastName

    def test_urls(self):
        primitive_func = self.primitive().get_function()
        names = pd.Series(
            [
                "Spector, Mr. Woolf",
                "Oliva y Ocana, Dona. Fermina",
                "Saether, Mr. Simon Sivertsen",
                "Ware, Mr. Frederick",
                "Peter, Master. Michael J",
            ],
        )
        answer = pd.Series(["Spector", "Oliva y Ocana", "Saether", "Ware", "Peter"])
        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)

    def test_no_title(self):
        primitive_func = self.primitive().get_function()
        names = pd.Series(
            [
                "Peter, Michael J",
                "James Masters",
                "Kate Elizabeth Brown-Jones",
            ],
        )
        answer = pd.Series(["Peter", "Masters", "Brown-Jones"], dtype=object)
        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)

    def test_empty_string(self):
        primitive_func = self.primitive().get_function()
        names = pd.Series(
            [
                "Peter, Michael J",
                "",
                "Kate Elizabeth Brown-Jones",
            ],
        )
        answer = pd.Series(["Peter", np.nan, "Brown-Jones"], dtype=object)
        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)

    def test_single_name(self):
        primitive_func = self.primitive().get_function()
        names = pd.Series(
            [
                "Peter, Michael J",
                "James",
                "Kate Elizabeth Brown-Jones",
            ],
        )
        answer = pd.Series(["Peter", np.nan, "Brown-Jones"], dtype=object)
        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)

    def test_nan(self):
        primitive_func = self.primitive().get_function()
        names = pd.Series(["Mr. James Brown", np.nan, None])
        answer = pd.Series(["Brown", np.nan, np.nan])
        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestFullNameToTitle(PrimitiveTestBase):
    primitive = FullNameToTitle

    def test_urls(self):
        primitive_func = self.primitive().get_function()
        names = pd.Series(
            [
                "Spector, Mr. Woolf",
                "Oliva y Ocana, Dona. Fermina",
                "Saether, Mr. Simon Sivertsen",
                "Ware, Mr. Frederick",
                "Peter, Master. Michael J",
                "Mr. Brown",
            ],
        )
        answer = pd.Series(["Mr", "Dona", "Mr", "Mr", "Master", "Mr"])
        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)

    def test_no_title(self):
        primitive_func = self.primitive().get_function()
        names = pd.Series(
            [
                "Peter, Michael J",
                "James Master.",
                "Mrs Brown",
                "",
            ],
        )
        answer = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=object)
        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)

    def test_nan(self):
        primitive_func = self.primitive().get_function()
        names = pd.Series(["Mr. Brown", np.nan, None])
        answer = pd.Series(["Mr", np.nan, np.nan])
        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_is_federal_holiday.py
================================================
from datetime import datetime

import numpy as np
import pandas as pd
from pytest import raises

from featuretools.primitives import IsFederalHoliday


def test_regular():
    primitive_instance = IsFederalHoliday()
    primitive_func = primitive_instance.get_function()
    case = pd.Series(
        [
            "2016-01-01",
            "2016-02-29",
            "2017-05-29",
            datetime(2019, 7, 4, 10, 0, 30),
        ],
    ).astype("datetime64[ns]")
    answer = pd.Series([True, False, True, True])
    given_answer = pd.Series(primitive_func(case))
    assert given_answer.equals(answer)


def test_nat():
    primitive_instance = IsFederalHoliday()
    primitive_func = primitive_instance.get_function()
    case = pd.Series(
        [
            "2019-10-14",
            "NaT",
            "2016-02-29",
            "NaT",
        ],
    ).astype("datetime64[ns]")
    answer = pd.Series([True, np.nan, False, np.nan])
    given_answer = pd.Series(primitive_func(case))
    assert given_answer.equals(answer)


def test_valid_country():
    primitive_instance = IsFederalHoliday(country="Canada")
    primitive_func = primitive_instance.get_function()
    case = pd.Series(
        [
            "2016-07-01",
            "2016-11-11",
            "2018-09-03",
        ],
    ).astype("datetime64[ns]")
    answer = pd.Series([True, False, True])
    given_answer = pd.Series(primitive_func(case))
    assert given_answer.equals(answer)


def test_invalid_country():
    error_text = "must be one of the available countries"
    with raises(ValueError, match=error_text):
        IsFederalHoliday(country="")


def test_multiple_countries():
    primitive_mexico = IsFederalHoliday(country="Mexico")
    primitive_func = primitive_mexico.get_function()
    case = pd.Series([datetime(2000, 9, 16), datetime(2005, 1, 1)])
    assert len(primitive_func(case)) > 1
    primitive_india = IsFederalHoliday(country="IND")
    primitive_func = primitive_mexico.get_function()
    case = pd.Series([datetime(2048, 1, 1), datetime(2048, 10, 2)])
    primitive_func = primitive_india.get_function()
    assert len(primitive_func(case)) > 1
    primitive_uk = IsFederalHoliday(country="UK")
    primitive_func = primitive_uk.get_function()
    case = pd.Series([datetime(2048, 3, 17), datetime(2048, 4, 6)])
    assert len(primitive_func(case)) > 1
    countries = [
        "Argentina",
        "AU",
        "Austria",
        "BY",
        "Belgium",
        "Brazil",
        "Canada",
        "Colombia",
        "Croatia",
        "England",
        "Finland",
        "FRA",
        "Germany",
        "Germany",
        "Italy",
        "NewZealand",
        "PortugalExt",
        "PTE",
        "Spain",
        "ES",
        "Switzerland",
        "UnitedStates",
        "US",
        "UK",
        "UA",
        "CH",
        "SE",
        "ZA",
    ]
    for x in countries:
        IsFederalHoliday(country=x)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_latlong_primitives.py
================================================
import numpy as np
import pandas as pd
import pytest

from featuretools.primitives import CityblockDistance, GeoMidpoint, IsInGeoBox


def test_cityblock():
    primitive_instance = CityblockDistance()
    latlong_1 = pd.Series([(i, i) for i in range(3)])
    latlong_2 = pd.Series([(i, i) for i in range(3, 6)])
    answer = pd.Series([414.56051391, 414.52893691, 414.43421555])
    given_answer = primitive_instance(latlong_1, latlong_2)
    np.testing.assert_allclose(given_answer, answer, rtol=1e-09)

    primitive_instance = CityblockDistance(unit="kilometers")
    answer = primitive_instance(latlong_1, latlong_2)
    given_answer = pd.Series([667.1704814, 667.11966315, 666.96722389])
    np.testing.assert_allclose(given_answer, answer, rtol=1e-09)


def test_cityblock_nans():
    primitive_instance = CityblockDistance()
    lats_longs_1 = [(i, i) for i in range(2)]
    lats_longs_2 = [(i, i) for i in range(2, 4)]
    lats_longs_1 += [(1, 1), (np.nan, 3), (4, np.nan), (np.nan, np.nan)]
    lats_longs_2 += [(np.nan, np.nan), (np.nan, 5), (6, np.nan), (np.nan, np.nan)]
    given_answer = pd.Series(list([276.37367594, 276.35262728] + [np.nan] * 4))
    answer = primitive_instance(lats_longs_1, lats_longs_2)
    np.testing.assert_allclose(given_answer, answer, rtol=1e-09)


def test_cityblock_error():
    error_text = "Invalid unit given"
    with pytest.raises(ValueError, match=error_text):
        CityblockDistance(unit="invalid")


def test_midpoint():
    latlong1 = pd.Series([(-90, -180), (90, 180)])
    latlong2 = pd.Series([(+90, +180), (-90, -180)])
    function = GeoMidpoint().get_function()
    answer = function(latlong1, latlong2)
    for lat, longi in answer:
        assert lat == 0.0
        assert longi == 0.0


def test_midpoint_floating():
    latlong1 = pd.Series([(-45.5, -100.5), (45.5, 100.5)])
    latlong2 = pd.Series([(+45.5, +100.5), (-45.5, -100.5)])
    function = GeoMidpoint().get_function()
    answer = function(latlong1, latlong2)
    for lat, longi in answer:
        assert lat == 0.0
        assert longi == 0.0


def test_midpoint_zeros():
    latlong1 = pd.Series([(0, 0), (0, 0)])
    latlong2 = pd.Series([(0, 0), (0, 0)])
    function = GeoMidpoint().get_function()
    answer = function(latlong1, latlong2)
    for lat, longi in answer:
        assert lat == 0.0
        assert longi == 0.0


def test_midpoint_nan():
    all_nan = pd.Series([(np.nan, np.nan), (np.nan, np.nan)])
    latlong1 = pd.Series([(0, 0), (0, 0)])
    function = GeoMidpoint().get_function()
    answer = function(all_nan, latlong1)
    for lat, longi in answer:
        assert np.isnan(lat)
        assert np.isnan(longi)


def test_isingeobox():
    latlong = pd.Series(
        [
            (1, 2),
            (5, 7),
            (-5, 4),
            (2, 3),
            (0, 0),
            (np.nan, np.nan),
            (-2, np.nan),
            (np.nan, 1),
        ],
    )
    bottomleft = (-5, -5)
    topright = (5, 5)
    primitive = IsInGeoBox(bottomleft, topright)
    function = primitive.get_function()
    primitive_answer = function(latlong)
    answer = pd.Series([True, False, True, True, True, False, False, False])
    assert np.array_equal(primitive_answer, answer)


def test_boston():
    NYC = (40.7128, -74.0060)
    SF = (37.7749, -122.4194)
    Somerville = (42.3876, -71.0995)
    Bejing = (39.9042, 116.4074)
    CapeTown = (-33.9249, 18.4241)
    latlong = pd.Series([NYC, SF, Somerville, Bejing, CapeTown])
    LynnMA = (42.4668, -70.9495)
    DedhamMA = (42.2436, -71.1677)
    primitive = IsInGeoBox(LynnMA, DedhamMA)
    function = primitive.get_function()
    primitive_answer = function(latlong)
    answer = pd.Series([False, False, True, False, False])
    assert np.array_equal(primitive_answer, answer)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_percent_change.py
================================================
import numpy as np
import pandas as pd
from pytest import raises

from featuretools.primitives import PercentChange
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestPercentChange(PrimitiveTestBase):
    primitive = PercentChange

    def test_regular(self):
        data = pd.Series([2, 5, 15, 3, 3, 9, 4.5])
        answer = pd.Series([np.nan, 1.5, 2.0, -0.8, 0, 2.0, -0.5])
        primtive_func = self.primitive().get_function()
        given_answer = primtive_func(data)
        np.testing.assert_array_equal(given_answer, answer)

    def test_raises(self):
        with raises(ValueError):
            self.primitive(fill_method="invalid")

    def test_period(self):
        data = pd.Series([2, 4, 8])
        answer = pd.Series([np.nan, np.nan, 3])
        primtive_func = self.primitive(periods=2).get_function()
        given_answer = primtive_func(data)
        np.testing.assert_array_equal(given_answer, answer)
        primtive_func = self.primitive(periods=2).get_function()
        data = pd.Series([2, 4, 8] + [np.nan] * 4)
        primtive_func = self.primitive(limit=2).get_function()
        answer = pd.Series([np.nan, 1, 1, 0, 0, np.nan, np.nan])
        given_answer = primtive_func(data)
        np.testing.assert_array_equal(given_answer, answer)

    def test_nan(self):
        data = pd.Series([np.nan, 5, 10, 20, np.nan, 10, np.nan])
        answer = pd.Series([np.nan, np.nan, 1, 1, 0, -0.5, 0])
        primtive_func = self.primitive().get_function()
        given_answer = primtive_func(data)
        np.testing.assert_array_equal(given_answer, answer)

    def test_zero(self):
        data = pd.Series([2, 0, 0, 5, 0, -4])
        answer = pd.Series([np.nan, -1, np.nan, np.inf, -1, np.NINF])
        primtive_func = self.primitive().get_function()
        given_answer = primtive_func(data)
        np.testing.assert_array_equal(given_answer, answer)

    def test_inf(self):
        data = pd.Series([0, np.inf, 0, 5, np.NINF, np.inf, np.NINF])
        answer = pd.Series([np.nan, np.inf, -1, np.inf, np.NINF, np.nan, np.nan])
        primtive_func = self.primitive().get_function()
        given_answer = primtive_func(data)
        np.testing.assert_array_equal(given_answer, answer)

    def test_freq(self):
        dates = pd.DatetimeIndex(
            ["2018-01-01", "2018-01-02", "2018-01-03", "2018-01-05"],
        )
        data = pd.Series([1, 2, 3, 4], index=dates)
        answer = pd.Series([np.nan, 1.0, 0.5, np.nan])
        date_offset = pd.tseries.offsets.DateOffset(days=1)
        primtive_func = self.primitive(freq=date_offset).get_function()
        given_answer = primtive_func(data)
        np.testing.assert_array_equal(given_answer, answer)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instantiate = self.primitive
        transform.append(primitive_instantiate)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_percent_unique.py
================================================
import numpy as np
import pandas as pd

from featuretools.primitives import PercentUnique
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
)


class TestPercentUnique(PrimitiveTestBase):
    array = pd.Series([1, 1, 2, 2, 3, 4, 5, 6, 7, 8])
    primitive = PercentUnique

    def test_percent_unique(self):
        primitive_func = self.primitive().get_function()
        assert primitive_func(self.array) == (8 / 10.0)

    def test_nans(self):
        primitive_func = self.primitive().get_function()
        array_nans = pd.concat([self.array.copy(), pd.Series([np.nan])])
        assert primitive_func(array_nans) == (8 / 11.0)
        primitive_func = self.primitive(skipna=False).get_function()
        assert primitive_func(array_nans) == (9 / 11.0)

    def test_multiple_nans(self):
        primitive_func = self.primitive().get_function()
        array_nans = pd.concat([self.array.copy(), pd.Series([np.nan] * 3)])
        assert primitive_func(array_nans) == (8 / 13.0)
        primitive_func = self.primitive(skipna=False).get_function()
        assert primitive_func(array_nans) == (9 / 13.0)

    def test_empty_string(self):
        primitive_func = self.primitive().get_function()
        array_empty_string = pd.concat([self.array.copy(), pd.Series([np.nan, "", ""])])
        assert primitive_func(array_empty_string) == (9 / 13.0)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_postal_primitives.py
================================================
import pandas as pd

from featuretools.primitives.standard.transform.postal import (
    OneDigitPostalCode,
    TwoDigitPostalCode,
)


def test_one_digit_postal_code(postal_code_dataframe):
    primitive = OneDigitPostalCode().get_function()
    for x in postal_code_dataframe:
        series = postal_code_dataframe[x]
        actual = primitive(series)
        expected = series.apply(lambda t: str(t)[0] if pd.notna(t) else pd.NA)
        pd.testing.assert_series_equal(actual, expected)


def test_two_digit_postal_code(postal_code_dataframe):
    primitive = TwoDigitPostalCode().get_function()
    for x in postal_code_dataframe:
        series = postal_code_dataframe[x]
        actual = primitive(series)
        expected = series.apply(lambda t: str(t)[:2] if pd.notna(t) else pd.NA)
        pd.testing.assert_series_equal(actual, expected)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_same_as_previous.py
================================================
import numpy as np
import pandas as pd
import pytest

from featuretools.primitives import SameAsPrevious


class TestSameAsPrevious:
    def test_ints(self):
        primitive_func = SameAsPrevious().get_function()
        array = pd.Series([1, 2, 2, 3, 2], dtype="int64")
        answer = primitive_func(array)
        correct_answer = pd.Series([False, False, True, False, False])
        pd.testing.assert_series_equal(answer, correct_answer)

    def test_int64(self):
        primitive_func = SameAsPrevious().get_function()
        array = pd.Series([1, 2, 2, 3, 2], dtype="Int64")
        answer = primitive_func(array)
        correct_answer = pd.Series([False, False, True, False, False], dtype="boolean")
        pd.testing.assert_series_equal(answer, correct_answer)

    def test_floats(self):
        primitive_func = SameAsPrevious().get_function()
        array = pd.Series([1.0, 2.5, 2.5, 3.0, 2.0], dtype="float64")
        answer = primitive_func(array)
        correct_answer = pd.Series([False, False, True, False, False])
        pd.testing.assert_series_equal(answer, correct_answer)

    def test_mixed(self):
        primitive_func = SameAsPrevious().get_function()
        array = pd.Series([1, 2, 2.0, 3, 2.0], dtype="float64")
        answer = primitive_func(array)
        correct_answer = pd.Series([False, False, True, False, False])
        np.testing.assert_array_equal(answer, correct_answer)

    def test_nan(self):
        primitive_instance = SameAsPrevious()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1, np.nan, 3, np.nan, 2], dtype="float64")
        answer = primitive_func(array)
        correct_answer = pd.Series([False, True, False, True, False])
        np.testing.assert_array_equal(answer, correct_answer)

    def test_all_nan(self):
        primitive_instance = SameAsPrevious()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype="float64")
        answer = primitive_func(array)
        correct_answer = pd.Series([False, False, False, False])
        np.testing.assert_array_equal(answer, correct_answer)

    def test_inf(self):
        primitive_instance = SameAsPrevious()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1, np.inf, 3, np.inf, 2], dtype="float64")
        answer = primitive_func(array)
        correct_answer = pd.Series([False, False, False, False, False])
        np.testing.assert_array_equal(answer, correct_answer)

    def test_all_inf(self):
        primitive_instance = SameAsPrevious()
        primitive_func = primitive_instance.get_function()
        array = pd.Series([np.inf, np.inf, np.inf, np.inf], dtype="float64")
        answer = primitive_func(array)
        correct_answer = pd.Series([False, True, True, True])
        np.testing.assert_array_equal(answer, correct_answer)

    def test_fill_method_bfill(self):
        primitive_instance = SameAsPrevious(fill_method="bfill")
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1, np.nan, 3, 2, 2], dtype="float64")
        answer = primitive_func(array)
        correct_answer = pd.Series([False, False, True, False, True])
        np.testing.assert_array_equal(answer, correct_answer)

    def test_fill_method_bfill_with_limit(self):
        primitive_instance = SameAsPrevious(fill_method="bfill", limit=2)
        primitive_func = primitive_instance.get_function()
        array = pd.Series([1, np.nan, np.nan, np.nan, 2, 3], dtype="float64")
        answer = primitive_func(array)
        correct_answer = pd.Series([False, False, False, True, True, False])
        np.testing.assert_array_equal(answer, correct_answer)

    def test_raises(self):
        with pytest.raises(ValueError):
            SameAsPrevious(fill_method="invalid")


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_savgol_filter.py
================================================
from math import floor

import numpy as np
import pandas as pd
from pytest import raises

from featuretools.primitives import SavgolFilter
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


class TestSavgolFilter(PrimitiveTestBase):
    primitive = SavgolFilter
    data = pd.Series(
        [
            0,
            1,
            1,
            2,
            3,
            4,
            5,
            7,
            8,
            7,
            9,
            9,
            12,
            11,
            12,
            14,
            15,
            17,
            17,
            17,
            20,
            21,
            20,
            20,
            22,
            21,
            25,
            25,
            26,
            29,
            30,
            30,
            28,
            26,
            34,
            35,
            33,
            31,
            38,
            34,
            39,
            37,
            42,
            35,
            36,
            44,
            46,
            43,
            39,
            39,
            44,
            49,
            45,
            44,
            44,
            52,
            50,
            47,
            58,
            59,
            60,
            55,
            57,
            63,
            61,
            65,
            66,
            57,
            65,
            61,
            60,
            71,
            64,
            62,
            70,
            65,
            67,
            77,
            68,
            75,
            72,
            69,
            82,
            66,
            84,
            80,
            76,
            87,
            77,
            73,
            90,
            91,
            92,
            93,
            78,
            76,
            82,
            96,
            91,
            94,
        ],
    )
    expected_output = pd.Series(
        [
            -0.24600037643516087,
            0.6354225484660259,
            1.518717742974036,
            2.405318302343475,
            3.296657321828948,
            4.1941678966850615,
            5.099283122166421,
            6.0134360935276305,
            6.938059906023296,
            7.874587654908025,
            8.824452435436303,
            9.786858450473883,
            10.923177508989724,
            12.025171624713803,
            13.009153318077633,
            14.08041843739766,
            14.900621118012227,
            15.796338672768673,
            16.77084014383764,
            17.662961752206375,
            18.472703497874882,
            19.451454723765682,
            20.530565544295253,
            21.849950964367157,
            22.478260869564927,
            23.15233736515171,
            24.12356979405003,
            25.23962079110788,
            26.000980712650854,
            27.082379862699877,
            27.787839163124843,
            28.879045439685797,
            29.762994442627924,
            31.067342268714864,
            32.11147433801854,
            32.666557698593884,
            33.06864988558309,
            34.00098071265075,
            35.134030728995945,
            36.15135665250035,
            36.945733899966825,
            37.56227525335028,
            38.55769859431137,
            39.3975155279498,
            39.87054593004198,
            40.304347826086435,
            41.11670480549146,
            42.00948022229432,
            41.982674076495044,
            42.62798300098016,
            43.15887544949274,
            44.53481529911678,
            45.680614579927486,
            46.93886891140834,
            47.98300098071202,
            48.80549199084604,
            50.28244524354299,
            52.66851912389601,
            54.28604118993064,
            55.81529911735788,
            57.10297482837455,
            57.82641386073805,
            59.45276234063342,
            60.77280156913945,
            61.23667865315383,
            61.81660673422607,
            62.60281137626594,
            62.54004576658957,
            62.78653154625613,
            63.23046747302958,
            64.09087937234307,
            65.25661981039471,
            65.19385420071833,
            66.34161490683144,
            66.65021248774022,
            67.38280483818154,
            68.8126838836212,
            69.79470415168265,
            70.943772474664,
            72.74076495586698,
            73.04020921869797,
            73.3586139261187,
            74.67734553775647,
            75.71559333115299,
            77.51814318404607,
            79.62471395880902,
            80.60150375939745,
            80.61163779012645,
            81.89342922523593,
            82.41124550506593,
            83.19293292519846,
            83.97174920172642,
            84.7620599588564,
            85.57823082079385,
            86.4346274117442,
            87.34561535591293,
            88.32556027750543,
            89.38882780072717,
            90.54978354978357,
            91.82279314888011,
        ],
    )

    def test_error(self):
        window_length = 1
        polyorder = 3
        mode = "incorrect"
        error_text = "polyorder must be less than window_length."
        with raises(ValueError, match=error_text):
            self.primitive(window_length, polyorder)

        error_text = (
            "Both window_length and polyorder must be defined if you define one."
        )

        with raises(ValueError, match=error_text):
            self.primitive(window_length=window_length)
        with raises(ValueError, match=error_text):
            self.primitive(polyorder=polyorder)
        error_text = "mode must be 'mirror', 'constant', 'nearest', 'wrap' or 'interp'."
        with raises(ValueError, match=error_text):
            self.primitive(
                window_length=window_length,
                polyorder=polyorder,
                mode=mode,
            )

    def test_less_window_size(self):
        primitive_func = self.primitive().get_function()
        for i in range(20):
            data = pd.Series(list(range(i)), dtype="float64")
            assert data.equals(primitive_func(data))

    def test_regular(self):
        window_length = floor(len(self.data) / 10) * 2 + 1
        polyorder = 3
        primitive_func = self.primitive(window_length, polyorder).get_function()
        output = list(primitive_func(self.data))
        for a, b in zip(self.expected_output, output):
            assert np.isclose(a, b)

    def test_nans(self):
        primitive_func = self.primitive().get_function()
        data_nans = self.data.copy()
        data_nans = pd.concat([data_nans, pd.Series([np.nan] * 5, dtype="float64")])
        # more than 5 nans due to window
        assert sum(np.isnan(primitive_func(data_nans))) == 15

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instantiate = self.primitive()
        transform.append(primitive_instantiate)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_season.py
================================================
from datetime import datetime

import pandas as pd

from featuretools.primitives import Season


class TestSeason:
    def test_regular(self):
        primitive_instance = Season()
        primitive_func = primitive_instance.get_function()
        case = pd.date_range(start="2019-01", periods=12, freq="m").to_series()
        answer = pd.Series(
            [
                "winter",
                "winter",
                "spring",
                "spring",
                "spring",
                "summer",
                "summer",
                "summer",
                "fall",
                "fall",
                "fall",
                "winter",
            ],
            dtype="string",
        )
        given_answer = primitive_func(case)
        pd.testing.assert_series_equal(
            given_answer.reset_index(drop=True),
            answer.reset_index(drop=True),
        )

    def test_nat(self):
        primitive_instance = Season()
        primitive_func = primitive_instance.get_function()
        case = pd.Series(
            [
                "NaT",
                "2019-02",
                "2019-03",
                "NaT",
            ],
        ).astype("datetime64[ns]")
        answer = pd.Series([pd.NA, "winter", "winter", pd.NA], dtype="string")
        given_answer = pd.Series(primitive_func(case))
        pd.testing.assert_series_equal(given_answer, answer)

    def test_datetime(self):
        primitive_instance = Season()
        primitive_func = primitive_instance.get_function()
        case = pd.Series(
            [
                datetime(2011, 3, 1),
                datetime(2011, 6, 1),
                datetime(2011, 9, 1),
                datetime(2011, 12, 1),
                # leap year
                datetime(2020, 2, 29),
            ],
        )
        answer = pd.Series(
            ["winter", "spring", "summer", "fall", "winter"],
            dtype="string",
        )
        given_answer = primitive_func(case)
        pd.testing.assert_series_equal(given_answer, answer)


================================================
FILE: featuretools/tests/primitive_tests/transform_primitive_tests/test_transform_primitive.py
================================================
import warnings
from datetime import datetime

import numpy as np
import pandas as pd
import pytest
from pytz import timezone

from featuretools.primitives import (
    Age,
    DateToTimeZone,
    DayOfYear,
    DaysInMonth,
    EmailAddressToDomain,
    FileExtension,
    IsFirstWeekOfMonth,
    IsFreeEmailDomain,
    IsLeapYear,
    IsLunchTime,
    IsMonthEnd,
    IsMonthStart,
    IsQuarterEnd,
    IsQuarterStart,
    IsWorkingHours,
    IsYearEnd,
    IsYearStart,
    Lag,
    NthWeekOfMonth,
    NumericLag,
    PartOfDay,
    Quarter,
    RateOfChange,
    TimeSince,
    URLToDomain,
    URLToProtocol,
    URLToTLD,
    Week,
    get_transform_primitives,
)
from featuretools.tests.primitive_tests.utils import (
    PrimitiveTestBase,
    find_applicable_primitives,
    valid_dfs,
)


def test_time_since():
    time_since = TimeSince()
    # class datetime.datetime(year, month, day[, hour[, minute[, second[, microsecond[,
    times = pd.Series(
        [
            datetime(2019, 3, 1, 0, 0, 0, 1),
            datetime(2019, 3, 1, 0, 0, 1, 0),
            datetime(2019, 3, 1, 0, 2, 0, 0),
        ],
    )
    cutoff_time = datetime(2019, 3, 1, 0, 0, 0, 0)
    values = time_since(array=times, time=cutoff_time)

    assert list(map(int, values)) == [0, -1, -120]

    time_since = TimeSince(unit="nanoseconds")
    values = time_since(array=times, time=cutoff_time)
    assert list(map(round, values)) == [-1000, -1000000000, -120000000000]

    time_since = TimeSince(unit="milliseconds")
    values = time_since(array=times, time=cutoff_time)
    assert list(map(int, values)) == [0, -1000, -120000]

    time_since = TimeSince(unit="Milliseconds")
    values = time_since(array=times, time=cutoff_time)
    assert list(map(int, values)) == [0, -1000, -120000]

    time_since = TimeSince(unit="Years")
    values = time_since(array=times, time=cutoff_time)
    assert list(map(int, values)) == [0, 0, 0]

    times_y = pd.Series(
        [
            datetime(2019, 3, 1, 0, 0, 0, 1),
            datetime(2020, 3, 1, 0, 0, 1, 0),
            datetime(2017, 3, 1, 0, 0, 0, 0),
        ],
    )

    time_since = TimeSince(unit="Years")
    values = time_since(array=times_y, time=cutoff_time)
    assert list(map(int, values)) == [0, -1, 1]

    error_text = "Invalid unit given, make sure it is plural"
    with pytest.raises(ValueError, match=error_text):
        time_since = TimeSince(unit="na")
        time_since(array=times, time=cutoff_time)


def test_age():
    age = Age()
    dates = pd.Series(datetime(2010, 2, 26))
    ages = age(dates, time=datetime(2020, 2, 26))
    correct_ages = [10.005]  # .005 added due to leap years
    np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3)


def test_age_two_years_quarterly():
    age = Age()
    dates = pd.Series(pd.date_range("2010-01-01", "2011-12-31", freq="Q"))
    ages = age(dates, time=datetime(2020, 2, 26))
    correct_ages = [9.915, 9.666, 9.414, 9.162, 8.915, 8.666, 8.414, 8.162]
    np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3)


def test_age_leap_year():
    age = Age()
    dates = pd.Series([datetime(2016, 1, 1)])
    ages = age(dates, time=datetime(2016, 3, 1))
    correct_ages = [(31 + 29) / 365.0]
    np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3)
    # born leap year date
    dates = pd.Series([datetime(2016, 2, 29)])
    ages = age(dates, time=datetime(2020, 2, 29))
    correct_ages = [4.0027]  # .0027 added due to leap year
    np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3)


def test_age_nan():
    age = Age()
    dates = pd.Series([datetime(2010, 1, 1), np.nan, datetime(2012, 1, 1)])
    ages = age(dates, time=datetime(2020, 2, 26))
    correct_ages = [10.159, np.nan, 8.159]
    np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3)


def test_day_of_year():
    doy = DayOfYear()
    dates = pd.Series([datetime(2019, 12, 31), np.nan, datetime(2020, 12, 31)])
    days_of_year = doy(dates)
    correct_days = [365, np.nan, 366]
    np.testing.assert_array_equal(days_of_year, correct_days)


def test_days_in_month():
    dim = DaysInMonth()
    dates = pd.Series(
        [datetime(2010, 1, 1), datetime(2019, 2, 1), np.nan, datetime(2020, 2, 1)],
    )
    days_in_month = dim(dates)
    correct_days = [31, 28, np.nan, 29]
    np.testing.assert_array_equal(days_in_month, correct_days)


def test_is_leap_year():
    ily = IsLeapYear()
    dates = pd.Series([datetime(2020, 1, 1), datetime(2021, 1, 1)])
    leap_year_bools = ily(dates)
    correct_bools = [True, False]
    np.testing.assert_array_equal(leap_year_bools, correct_bools)


def test_is_month_end():
    ime = IsMonthEnd()
    dates = pd.Series(
        [datetime(2019, 3, 1), datetime(2021, 2, 28), datetime(2020, 2, 29)],
    )
    ime_bools = ime(dates)
    correct_bools = [False, True, True]
    np.testing.assert_array_equal(ime_bools, correct_bools)


def test_is_month_start():
    ims = IsMonthStart()
    dates = pd.Series(
        [datetime(2019, 3, 1), datetime(2020, 2, 28), datetime(2020, 2, 29)],
    )
    ims_bools = ims(dates)
    correct_bools = [True, False, False]
    np.testing.assert_array_equal(ims_bools, correct_bools)


def test_is_quarter_end():
    iqe = IsQuarterEnd()
    dates = pd.Series([datetime(2020, 1, 1), datetime(2021, 3, 31)])
    iqe_bools = iqe(dates)
    correct_bools = [False, True]
    np.testing.assert_array_equal(iqe_bools, correct_bools)


def test_is_quarter_start():
    iqs = IsQuarterStart()
    dates = pd.Series([datetime(2020, 1, 1), datetime(2021, 3, 31)])
    iqs_bools = iqs(dates)
    correct_bools = [True, False]
    np.testing.assert_array_equal(iqs_bools, correct_bools)


def test_is_lunch_time_default():
    is_lunch_time = IsLunchTime()
    dates = pd.Series(
        [
            datetime(2022, 6, 26, 12, 12, 12),
            datetime(2022, 6, 28, 12, 3, 4),
            datetime(2022, 6, 28, 11, 3, 4),
            np.nan,
        ],
    )
    actual = is_lunch_time(dates)
    expected = [True, True, False, False]
    np.testing.assert_array_equal(actual, expected)


def test_is_lunch_time_configurable():
    is_lunch_time = IsLunchTime(14)
    dates = pd.Series(
        [
            datetime(2022, 6, 26, 12, 12, 12),
            datetime(2022, 6, 28, 14, 3, 4),
            datetime(2022, 6, 28, 11, 3, 4),
            np.nan,
        ],
    )
    actual = is_lunch_time(dates)
    expected = [False, True, False, False]
    np.testing.assert_array_equal(actual, expected)


def test_is_working_hours_standard_hours():
    is_working_hours = IsWorkingHours()
    dates = pd.Series(
        [
            datetime(2022, 6, 21, 16, 3, 3),
            datetime(2019, 1, 3, 4, 4, 4),
            datetime(2022, 1, 1, 12, 1, 2),
        ],
    )
    actual = is_working_hours(dates).tolist()
    expected = [True, False, True]
    np.testing.assert_array_equal(actual, expected)


def test_is_working_hours_configured_hours():
    is_working_hours = IsWorkingHours(15, 18)
    dates = pd.Series(
        [
            datetime(2022, 6, 21, 16, 3, 3),
            datetime(2022, 6, 26, 14, 4, 4),
            datetime(2022, 1, 1, 12, 1, 2),
        ],
    )
    answer = is_working_hours(dates).tolist()
    expected = [True, False, False]
    np.testing.assert_array_equal(answer, expected)


def test_part_of_day():
    pod = PartOfDay()
    dates = pd.Series(
        [
            datetime(2020, 1, 11, 0, 2, 1),
            datetime(2020, 1, 11, 1, 2, 1),
            datetime(2021, 3, 31, 4, 2, 1),
            datetime(2020, 3, 4, 6, 2, 1),
            datetime(2020, 3, 4, 8, 2, 1),
            datetime(2020, 3, 4, 11, 2, 1),
            datetime(2020, 3, 4, 14, 2, 3),
            datetime(2020, 3, 4, 17, 2, 3),
            datetime(2020, 2, 2, 20, 2, 2),
            np.nan,
        ],
    )
    actual = pod(dates)
    expected = pd.Series(
        [
            "midnight",
            "midnight",
            "dawn",
            "early morning",
            "late morning",
            "noon",
            "afternoon",
            "evening",
            "night",
            np.nan,
        ],
    )
    pd.testing.assert_series_equal(expected, actual)


def test_is_year_end():
    is_year_end = IsYearEnd()
    dates = pd.Series([datetime(2020, 12, 31), np.nan, datetime(2020, 1, 1)])
    answer = is_year_end(dates)
    correct_answer = [True, False, False]
    np.testing.assert_array_equal(answer, correct_answer)


def test_is_year_start():
    is_year_start = IsYearStart()
    dates = pd.Series([datetime(2020, 12, 31), np.nan, datetime(2020, 1, 1)])
    answer = is_year_start(dates)
    correct_answer = [False, False, True]
    np.testing.assert_array_equal(answer, correct_answer)


def test_quarter_regular():
    q = Quarter()
    array = pd.Series(
        [
            pd.to_datetime("2018-01-01"),
            pd.to_datetime("2018-04-01"),
            pd.to_datetime("2018-07-01"),
            pd.to_datetime("2018-10-01"),
        ],
    )
    answer = q(array)
    correct_answer = pd.Series([1, 2, 3, 4])
    np.testing.assert_array_equal(answer, correct_answer)


def test_quarter_leap_year():
    q = Quarter()
    array = pd.Series(
        [
            pd.to_datetime("2016-02-29"),
            pd.to_datetime("2018-04-01"),
            pd.to_datetime("2018-07-01"),
            pd.to_datetime("2018-10-01"),
        ],
    )
    answer = q(array)
    correct_answer = pd.Series([1, 2, 3, 4])
    np.testing.assert_array_equal(answer, correct_answer)


def test_quarter_nan_and_nat_input():
    q = Quarter()
    array = pd.Series(
        [
            pd.to_datetime("2016-02-29"),
            np.nan,
            np.datetime64("NaT"),
            pd.to_datetime("2018-10-01"),
        ],
    )
    answer = q(array)
    correct_answer = pd.Series([1, np.nan, np.nan, 4])
    np.testing.assert_array_equal(answer, correct_answer)


def test_quarter_year_before_1970():
    q = Quarter()
    array = pd.Series(
        [
            pd.to_datetime("2018-01-01"),
            pd.to_datetime("1950-04-01"),
            pd.to_datetime("1874-07-01"),
            pd.to_datetime("2018-10-01"),
        ],
    )
    answer = q(array)
    correct_answer = pd.Series([1, 2, 3, 4])
    np.testing.assert_array_equal(answer, correct_answer)


def test_quarter_year_after_2038():
    q = Quarter()
    array = pd.Series(
        [
            pd.to_datetime("2018-01-01"),
            pd.to_datetime("2050-04-01"),
            pd.to_datetime("2174-07-01"),
            pd.to_datetime("2018-10-01"),
        ],
    )
    answer = q(array)
    correct_answer = pd.Series([1, 2, 3, 4])
    np.testing.assert_array_equal(answer, correct_answer)


def test_quarter():
    q = Quarter()
    dates = [datetime(2019, 12, 1), datetime(2019, 1, 3), datetime(2020, 2, 1)]
    quarter = q(dates)
    correct_quarters = [4, 1, 1]
    np.testing.assert_array_equal(quarter, correct_quarters)


def test_week_no_deprecation_message():
    dates = [
        datetime(2019, 1, 3),
        datetime(2019, 6, 17, 11, 10, 50),
        datetime(2019, 11, 30, 19, 45, 15),
    ]
    with warnings.catch_warnings():
        warnings.simplefilter("error")
        week = Week()
        week(dates).tolist()


def test_url_to_domain_urls():
    url_to_domain = URLToDomain()
    urls = pd.Series(
        [
            "https://play.google.com/store/apps/details?id=com.skgames.trafficracer%22",
            "http://mplay.google.co.in/sadfask/asdkfals?dk=10",
            "http://lplay.google.co.in/sadfask/asdkfals?dk=10",
            "http://play.google.co.in/sadfask/asdkfals?dk=10",
            "http://tplay.google.co.in/sadfask/asdkfals?dk=10",
            "http://www.google.co.in/sadfask/asdkfals?dk=10",
            "www.google.co.in/sadfask/asdkfals?dk=10",
            "http://user:pass@google.com/?a=b#asdd",
            "https://www.compzets.com?asd=10",
            "www.compzets.com?asd=10",
            "facebook.com",
            "https://www.compzets.net?asd=10",
            "http://www.featuretools.org",
        ],
    )
    correct_urls = [
        "play.google.com",
        "mplay.google.co.in",
        "lplay.google.co.in",
        "play.google.co.in",
        "tplay.google.co.in",
        "google.co.in",
        "google.co.in",
        "google.com",
        "compzets.com",
        "compzets.com",
        "facebook.com",
        "compzets.net",
        "featuretools.org",
    ]
    np.testing.assert_array_equal(url_to_domain(urls), correct_urls)


def test_url_to_domain_long_url():
    url_to_domain = URLToDomain()
    urls = pd.Series(
        [
            "http://chart.apis.google.com/chart?chs=500x500&chma=0,0,100, \
                        100&cht=p&chco=FF0000%2CFFFF00%7CFF8000%2C00FF00%7C00FF00%2C0 \
                        000FF&chd=t%3A122%2C42%2C17%2C10%2C8%2C7%2C7%2C7%2C7%2C6%2C6% \
                        2C6%2C6%2C5%2C5&chl=122%7C42%7C17%7C10%7C8%7C7%7C7%7C7%7C7%7C \
                        6%7C6%7C6%7C6%7C5%7C5&chdl=android%7Cjava%7Cstack-trace%7Cbro \
                        adcastreceiver%7Candroid-ndk%7Cuser-agent%7Candroid-webview%7 \
                        Cwebview%7Cbackground%7Cmultithreading%7Candroid-source%7Csms \
                        %7Cadb%7Csollections%7Cactivity|Chart",
        ],
    )
    correct_urls = ["chart.apis.google.com"]
    results = url_to_domain(urls)
    np.testing.assert_array_equal(results, correct_urls)


def test_url_to_domain_nan():
    url_to_domain = URLToDomain()
    urls = pd.Series(["www.featuretools.com", np.nan], dtype="object")
    correct_urls = pd.Series(["featuretools.com", np.nan], dtype="object")
    results = url_to_domain(urls)
    pd.testing.assert_series_equal(results, correct_urls)


def test_url_to_protocol_urls():
    url_to_protocol = URLToProtocol()
    urls = pd.Series(
        [
            "https://play.google.com/store/apps/details?id=com.skgames.trafficracer%22",
            "http://mplay.google.co.in/sadfask/asdkfals?dk=10",
            "http://lplay.google.co.in/sadfask/asdkfals?dk=10",
            "www.google.co.in/sadfask/asdkfals?dk=10",
            "http://user:pass@google.com/?a=b#asdd",
            "https://www.compzets.com?asd=10",
            "www.compzets.com?asd=10",
            "facebook.com",
            "https://www.compzets.net?asd=10",
            "http://www.featuretools.org",
            "https://featuretools.com",
        ],
    )
    correct_urls = pd.Series(
        [
            "https",
            "http",
            "http",
            np.nan,
            "http",
            "https",
            np.nan,
            np.nan,
            "https",
            "http",
            "https",
        ],
    )
    results = url_to_protocol(urls)
    pd.testing.assert_series_equal(results, correct_urls)


def test_url_to_protocol_long_url():
    url_to_protocol = URLToProtocol()
    urls = pd.Series(
        [
            "http://chart.apis.google.com/chart?chs=500x500&chma=0,0,100, \
                        100&cht=p&chco=FF0000%2CFFFF00%7CFF8000%2C00FF00%7C00FF00%2C0 \
                        000FF&chd=t%3A122%2C42%2C17%2C10%2C8%2C7%2C7%2C7%2C7%2C6%2C6% \
                        2C6%2C6%2C5%2C5&chl=122%7C42%7C17%7C10%7C8%7C7%7C7%7C7%7C7%7C \
                        6%7C6%7C6%7C6%7C5%7C5&chdl=android%7Cjava%7Cstack-trace%7Cbro \
                        adcastreceiver%7Candroid-ndk%7Cuser-agent%7Candroid-webview%7 \
                        Cwebview%7Cbackground%7Cmultithreading%7Candroid-source%7Csms \
                        %7Cadb%7Csollections%7Cactivity|Chart",
        ],
    )
    correct_urls = ["http"]
    results = url_to_protocol(urls)
    np.testing.assert_array_equal(results, correct_urls)


def test_url_to_protocol_nan():
    url_to_protocol = URLToProtocol()
    urls = pd.Series(["www.featuretools.com", np.nan, ""], dtype="object")
    correct_urls = pd.Series([np.nan, np.nan, np.nan], dtype="object")
    results = url_to_protocol(urls)
    pd.testing.assert_series_equal(results, correct_urls)


def test_url_to_tld_urls():
    url_to_tld = URLToTLD()
    urls = pd.Series(
        [
            "https://play.google.com/store/apps/details?id=com.skgames.trafficracer%22",
            "http://mplay.google.co.in/sadfask/asdkfals?dk=10",
            "http://lplay.google.co.in/sadfask/asdkfals?dk=10",
            "http://play.google.co.in/sadfask/asdkfals?dk=10",
            "http://tplay.google.co.in/sadfask/asdkfals?dk=10",
            "http://www.google.co.in/sadfask/asdkfals?dk=10",
            "www.google.co.in/sadfask/asdkfals?dk=10",
            "http://user:pass@google.com/?a=b#asdd",
            "https://www.compzets.dev?asd=10",
            "www.compzets.com?asd=10",
            "https://www.compzets.net?asd=10",
            "http://www.featuretools.org",
            "featuretools.org",
        ],
    )
    correct_urls = [
        "com",
        "in",
        "in",
        "in",
        "in",
        "in",
        "in",
        "com",
        "dev",
        "com",
        "net",
        "org",
        "org",
    ]
    np.testing.assert_array_equal(url_to_tld(urls), correct_urls)


def test_url_to_tld_long_url():
    url_to_tld = URLToTLD()
    urls = pd.Series(
        [
            "http://chart.apis.google.com/chart?chs=500x500&chma=0,0,100, \
                        100&cht=p&chco=FF0000%2CFFFF00%7CFF8000%2C00FF00%7C00FF00%2C0 \
                        000FF&chd=t%3A122%2C42%2C17%2C10%2C8%2C7%2C7%2C7%2C7%2C6%2C6% \
                        2C6%2C6%2C5%2C5&chl=122%7C42%7C17%7C10%7C8%7C7%7C7%7C7%7C7%7C \
                        6%7C6%7C6%7C6%7C5%7C5&chdl=android%7Cjava%7Cstack-trace%7Cbro \
                        adcastreceiver%7Candroid-ndk%7Cuser-agent%7Candroid-webview%7 \
                        Cwebview%7Cbackground%7Cmultithreading%7Candroid-source%7Csms \
                        %7Cadb%7Csollections%7Cactivity|Chart",
        ],
    )
    correct_urls = ["com"]
    np.testing.assert_array_equal(url_to_tld(urls), correct_urls)


def test_url_to_tld_nan():
    url_to_tld = URLToTLD()
    urls = pd.Series(
        ["www.featuretools.com", np.nan, "featuretools", ""],
        dtype="object",
    )
    correct_urls = pd.Series(["com", np.nan, np.nan, np.nan], dtype="object")
    results = url_to_tld(urls)
    pd.testing.assert_series_equal(results, correct_urls, check_names=False)


def test_is_free_email_domain_valid_addresses():
    is_free_email_domain = IsFreeEmailDomain()
    array = pd.Series(
        [
            "test@hotmail.com",
            "name@featuretools.com",
            "nobody@yahoo.com",
            "free@gmail.com",
        ],
    )
    answers = pd.Series(is_free_email_domain(array))
    correct_answers = pd.Series([True, False, True, True])
    pd.testing.assert_series_equal(answers, correct_answers)


def test_is_free_email_domain_valid_addresses_whitespace():
    is_free_email_domain = IsFreeEmailDomain()
    array = pd.Series(
        [
            " test@hotmail.com",
            " name@featuretools.com",
            "nobody@yahoo.com ",
            " free@gmail.com ",
        ],
    )
    answers = pd.Series(is_free_email_domain(array))
    correct_answers = pd.Series([True, False, True, True])
    pd.testing.assert_series_equal(answers, correct_answers)


def test_is_free_email_domain_nan():
    is_free_email_domain = IsFreeEmailDomain()
    array = pd.Series([np.nan, "name@featuretools.com", "nobody@yahoo.com"])
    answers = pd.Series(is_free_email_domain(array))
    correct_answers = pd.Series([np.nan, False, True])
    pd.testing.assert_series_equal(answers, correct_answers)


def test_is_free_email_domain_empty_string():
    is_free_email_domain = IsFreeEmailDomain()
    array = pd.Series(["", "name@featuretools.com", "nobody@yahoo.com"])
    answers = pd.Series(is_free_email_domain(array))
    correct_answers = pd.Series([np.nan, False, True])
    pd.testing.assert_series_equal(answers, correct_answers)


def test_is_free_email_domain_empty_series():
    is_free_email_domain = IsFreeEmailDomain()
    array = pd.Series([], dtype="category")
    answers = pd.Series(is_free_email_domain(array))
    correct_answers = pd.Series([], dtype="category")
    pd.testing.assert_series_equal(answers, correct_answers)


def test_is_free_email_domain_invalid_email():
    is_free_email_domain = IsFreeEmailDomain()
    array = pd.Series(
        [
            np.nan,
            "this is not an email address",
            "name@featuretools.com",
            "nobody@yahoo.com",
            1234,
            1.23,
            True,
        ],
    )
    answers = pd.Series(is_free_email_domain(array))
    correct_answers = pd.Series([np.nan, np.nan, False, True, np.nan, np.nan, np.nan])
    pd.testing.assert_series_equal(answers, correct_answers)


def test_is_free_email_domain_all_nan():
    is_free_email_domain = IsFreeEmailDomain()
    array = pd.Series([np.nan, np.nan])
    answers = pd.Series(is_free_email_domain(array))
    correct_answers = pd.Series([np.nan, np.nan], dtype=object)
    pd.testing.assert_series_equal(answers, correct_answers)


def test_email_address_to_domain_valid_addresses():
    email_address_to_domain = EmailAddressToDomain()
    array = pd.Series(
        [
            "test@hotmail.com",
            "name@featuretools.com",
            "nobody@yahoo.com",
            "free@gmail.com",
        ],
    )
    answers = pd.Series(email_address_to_domain(array))
    correct_answers = pd.Series(
        ["hotmail.com", "featuretools.com", "yahoo.com", "gmail.com"],
    )
    pd.testing.assert_series_equal(answers, correct_answers)


def test_email_address_to_domain_valid_addresses_whitespace():
    email_address_to_domain = EmailAddressToDomain()
    array = pd.Series(
        [
            " test@hotmail.com",
            " name@featuretools.com",
            "nobody@yahoo.com ",
            " free@gmail.com ",
        ],
    )
    answers = pd.Series(email_address_to_domain(array))
    correct_answers = pd.Series(
        ["hotmail.com", "featuretools.com", "yahoo.com", "gmail.com"],
    )
    pd.testing.assert_series_equal(answers, correct_answers)


def test_email_address_to_domain_nan():
    email_address_to_domain = EmailAddressToDomain()
    array = pd.Series([np.nan, "name@featuretools.com", "nobody@yahoo.com"])
    answers = pd.Series(email_address_to_domain(array))
    correct_answers = pd.Series([np.nan, "featuretools.com", "yahoo.com"])
    pd.testing.assert_series_equal(answers, correct_answers)


def test_email_address_to_domain_empty_string():
    email_address_to_domain = EmailAddressToDomain()
    array = pd.Series(["", "name@featuretools.com", "nobody@yahoo.com"])
    answers = pd.Series(email_address_to_domain(array))
    correct_answers = pd.Series([np.nan, "featuretools.com", "yahoo.com"])
    pd.testing.assert_series_equal(answers, correct_answers)


def test_email_address_to_domain_empty_series():
    email_address_to_domain = EmailAddressToDomain()
    array = pd.Series([], dtype="category")
    answers = pd.Series(email_address_to_domain(array))
    correct_answers = pd.Series([], dtype="category")
    pd.testing.assert_series_equal(answers, correct_answers)


def test_email_address_to_domain_invalid_email():
    email_address_to_domain = EmailAddressToDomain()
    array = pd.Series(
        [
            np.nan,
            "this is not an email address",
            "name@featuretools.com",
            "nobody@yahoo.com",
            1234,
            1.23,
            True,
        ],
    )
    answers = pd.Series(email_address_to_domain(array))
    correct_answers = pd.Series(
        [np.nan, np.nan, "featuretools.com", "yahoo.com", np.nan, np.nan, np.nan],
    )
    pd.testing.assert_series_equal(answers, correct_answers)


def test_email_address_to_domain_all_nan():
    email_address_to_domain = EmailAddressToDomain()
    array = pd.Series([np.nan, np.nan])
    answers = pd.Series(email_address_to_domain(array))
    correct_answers = pd.Series([np.nan, np.nan], dtype=object)
    pd.testing.assert_series_equal(answers, correct_answers)


def test_trans_primitives_can_init_without_params():
    trans_primitives = get_transform_primitives().values()
    for trans_primitive in trans_primitives:
        trans_primitive()


def test_numeric_lag_future_warning():
    warning_text = "NumericLag is deprecated and will be removed in a future version. Please use the 'Lag' primitive instead."
    with pytest.warns(FutureWarning, match=warning_text):
        NumericLag()


def test_lag_regular():
    primitive_instance = Lag()
    primitive_func = primitive_instance.get_function()

    array = pd.Series([1, 2, 3, 4])
    time_array = pd.Series(pd.date_range(start="2020-01-01", periods=4, freq="D"))

    answer = pd.Series(primitive_func(array, time_array))

    correct_answer = pd.Series([np.nan, 1, 2, 3])
    pd.testing.assert_series_equal(answer, correct_answer)


def test_lag_period():
    primitive_instance = Lag(periods=3)
    primitive_func = primitive_instance.get_function()

    array = pd.Series([1, 2, 3, 4])
    time_array = pd.Series(pd.date_range(start="2020-01-01", periods=4, freq="D"))

    answer = pd.Series(primitive_func(array, time_array))

    correct_answer = pd.Series([np.nan, np.nan, np.nan, 1])
    pd.testing.assert_series_equal(answer, correct_answer)


def test_lag_negative_period():
    primitive_instance = Lag(periods=-2)
    primitive_func = primitive_instance.get_function()

    array = pd.Series([1, 2, 3, 4])
    time_array = pd.Series(pd.date_range(start="2020-01-01", periods=4, freq="D"))

    answer = pd.Series(primitive_func(array, time_array))

    correct_answer = pd.Series([3, 4, np.nan, np.nan])
    pd.testing.assert_series_equal(answer, correct_answer)


def test_lag_starts_with_nan():
    primitive_instance = Lag()
    primitive_func = primitive_instance.get_function()

    array = pd.Series([np.nan, 2, 3, 4])
    time_array = pd.Series(pd.date_range(start="2020-01-01", periods=4, freq="D"))

    answer = pd.Series(primitive_func(array, time_array))

    correct_answer = pd.Series([np.nan, np.nan, 2, 3])
    pd.testing.assert_series_equal(answer, correct_answer)


def test_lag_ends_with_nan():
    primitive_instance = Lag()
    primitive_func = primitive_instance.get_function()

    array = pd.Series([1, 2, 3, np.nan])
    time_array = pd.Series(pd.date_range(start="2020-01-01", periods=4, freq="D"))

    answer = pd.Series(primitive_func(array, time_array))

    correct_answer = pd.Series([np.nan, 1, 2, 3])
    pd.testing.assert_series_equal(answer, correct_answer)


@pytest.mark.parametrize(
    "input_array,expected_output",
    [
        (
            pd.Series(["hello", "world", "foo", "bar"], dtype="string"),
            pd.Series([np.nan, "hello", "world", "foo"], dtype="string"),
        ),
        (
            pd.Series(["cow", "cow", "pig", "pig"], dtype="category"),
            pd.Series([np.nan, "cow", "cow", "pig"], dtype="category"),
        ),
        (
            pd.Series([True, False, True, False], dtype="bool"),
            pd.Series([np.nan, True, False, True], dtype="object"),
        ),
        (
            pd.Series([True, False, True, False], dtype="boolean"),
            pd.Series([np.nan, True, False, True], dtype="boolean"),
        ),
        (
            pd.Series([1.23, 2.45, 3.56, 4.98], dtype="float"),
            pd.Series([np.nan, 1.23, 2.45, 3.56], dtype="float"),
        ),
        (
            pd.Series([1, 2, 3, 4], dtype="Int64"),
            pd.Series([np.nan, 1, 2, 3], dtype="Int64"),
        ),
        (
            pd.Series([1, 2, 3, 4], dtype="int64"),
            pd.Series([np.nan, 1, 2, 3], dtype="float64"),
        ),
    ],
)
def test_lag_with_different_dtypes(input_array, expected_output):
    primitive_instance = Lag()
    primitive_func = primitive_instance.get_function()
    time_array = pd.Series(pd.date_range(start="2020-01-01", periods=4, freq="D"))
    answer = pd.Series(primitive_func(input_array, time_array))
    pd.testing.assert_series_equal(answer, expected_output)


def test_date_to_time_zone_primitive():
    primitive_func = DateToTimeZone().get_function()
    x = pd.Series(
        [
            datetime(2010, 1, 1, tzinfo=timezone("America/Los_Angeles")),
            datetime(2010, 1, 10, tzinfo=timezone("Singapore")),
            datetime(2020, 1, 1, tzinfo=timezone("UTC")),
            datetime(2010, 1, 1, tzinfo=timezone("Europe/London")),
        ],
    )
    answer = pd.Series(["America/Los_Angeles", "Singapore", "UTC", "Europe/London"])
    pd.testing.assert_series_equal(primitive_func(x), answer)


def test_date_to_time_zone_datetime64():
    primitive_func = DateToTimeZone().get_function()
    x = pd.Series(
        [
            datetime(2010, 1, 1),
            datetime(2010, 1, 10),
            datetime(2020, 1, 1),
        ],
    ).astype("datetime64[ns]")
    x = x.dt.tz_localize("America/Los_Angeles")
    answer = pd.Series(["America/Los_Angeles"] * 3)
    pd.testing.assert_series_equal(primitive_func(x), answer)


def test_date_to_time_zone_naive_dates():
    primitive_func = DateToTimeZone().get_function()
    x = pd.Series(
        [
            datetime(2010, 1, 1, tzinfo=timezone("America/Los_Angeles")),
            datetime(2010, 1, 1),
            datetime(2010, 1, 2),
        ],
    )
    answer = pd.Series(["America/Los_Angeles", np.nan, np.nan])
    pd.testing.assert_series_equal(primitive_func(x), answer)


def test_date_to_time_zone_nan():
    primitive_func = DateToTimeZone().get_function()
    x = pd.Series(
        [
            datetime(2010, 1, 1, tzinfo=timezone("America/Los_Angeles")),
            pd.NaT,
            np.nan,
        ],
    )
    answer = pd.Series(["America/Los_Angeles", np.nan, np.nan])
    pd.testing.assert_series_equal(primitive_func(x), answer)


def test_rate_of_change_primitive_regular_interval():
    rate_of_change = RateOfChange()
    times = pd.date_range(start="2019-01-01", freq="2s", periods=5)
    values = [0, 30, 180, -90, 0]
    expected = pd.Series([np.nan, 15, 75, -135, 45])
    actual = rate_of_change(values, times)
    pd.testing.assert_series_equal(actual, expected)


def test_rate_of_change_primitive_uneven_interval():
    rate_of_change = RateOfChange()
    times = pd.to_datetime(
        [
            "2019-01-01 00:00:00",
            "2019-01-01 00:00:01",
            "2019-01-01 00:00:03",
            "2019-01-01 00:00:07",
            "2019-01-01 00:00:08",
        ],
    )
    values = [0, 30, 180, -90, 0]
    expected = pd.Series([np.nan, 30, 75, -67.5, 90])
    actual = rate_of_change(values, times)
    pd.testing.assert_series_equal(actual, expected)


def test_rate_of_change_primitive_with_nan():
    rate_of_change = RateOfChange()
    times = pd.date_range(start="2019-01-01", freq="2s", periods=5)
    values = [0, 30, np.nan, -90, 0]
    expected = pd.Series([np.nan, 15, np.nan, np.nan, 45])
    actual = rate_of_change(values, times)
    pd.testing.assert_series_equal(actual, expected)


class TestFileExtension(PrimitiveTestBase):
    primitive = FileExtension

    def test_filepaths(self):
        primitive_func = FileExtension().get_function()
        array = pd.Series(
            [
                "doc.txt",
                "~/documents/data.json",
                "data.JSON",
                "C:\\Projects\\apilibrary\\apilibrary.sln",
            ],
            dtype="string",
        )
        answer = pd.Series([".txt", ".json", ".json", ".sln"], dtype="string")
        pd.testing.assert_series_equal(primitive_func(array), answer)

    def test_invalid(self):
        primitive_func = FileExtension().get_function()
        array = pd.Series(["doc.txt", "~/documents/data", np.nan], dtype="string")
        answer = pd.Series([".txt", np.nan, np.nan], dtype="string")
        pd.testing.assert_series_equal(primitive_func(array), answer)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(
            es,
            aggregation,
            transform,
            self.primitive,
            target_dataframe_name="sessions",
        )


class TestIsFirstWeekOfMonth(PrimitiveTestBase):
    primitive = IsFirstWeekOfMonth

    def test_valid_dates(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                pd.to_datetime("03/01/2019"),
                pd.to_datetime("03/03/2019"),
                pd.to_datetime("03/31/2019"),
                pd.to_datetime("03/30/2019"),
            ],
        )
        answers = primitive_func(array).tolist()
        correct_answers = [True, False, False, False]
        np.testing.assert_array_equal(answers, correct_answers)

    def test_leap_year(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                pd.to_datetime("03/01/2019"),
                pd.to_datetime("02/29/2016"),
                pd.to_datetime("03/31/2019"),
                pd.to_datetime("03/30/2019"),
            ],
        )
        answers = primitive_func(array).tolist()
        correct_answers = [True, False, False, False]
        np.testing.assert_array_equal(answers, correct_answers)

    def test_year_before_1970(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                pd.to_datetime("06/01/1965"),
                pd.to_datetime("03/02/2019"),
                pd.to_datetime("03/31/2019"),
                pd.to_datetime("03/30/2019"),
            ],
        )
        answers = primitive_func(array).tolist()
        correct_answers = [True, True, False, False]
        np.testing.assert_array_equal(answers, correct_answers)

    def test_year_after_2038(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                pd.to_datetime("12/31/2040"),
                pd.to_datetime("01/01/2040"),
                pd.to_datetime("03/31/2019"),
                pd.to_datetime("03/30/2019"),
            ],
        )
        answers = primitive_func(array).tolist()
        correct_answers = [False, True, False, False]
        np.testing.assert_array_equal(answers, correct_answers)

    def test_nan_input(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                pd.to_datetime("03/01/2019"),
                np.nan,
                np.datetime64("NaT"),
                pd.to_datetime("03/30/2019"),
            ],
        )
        answers = primitive_func(array).tolist()
        correct_answers = [True, np.nan, np.nan, False]
        np.testing.assert_array_equal(answers, correct_answers)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


class TestNthWeekOfMonth(PrimitiveTestBase):
    primitive = NthWeekOfMonth

    def test_valid_dates(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                pd.to_datetime("03/01/2019"),
                pd.to_datetime("03/03/2019"),
                pd.to_datetime("03/31/2019"),
                pd.to_datetime("03/30/2019"),
                pd.to_datetime("09/01/2019"),
            ],
        )
        answers = primitive_func(array)
        correct_answers = [1, 2, 6, 5, 1]
        np.testing.assert_array_equal(answers, correct_answers)

    def test_leap_year(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                pd.to_datetime("03/01/2019"),
                pd.to_datetime("02/29/2016"),
                pd.to_datetime("03/31/2019"),
                pd.to_datetime("03/30/2019"),
            ],
        )
        answers = primitive_func(array)
        correct_answers = [1, 5, 6, 5]
        np.testing.assert_array_equal(answers, correct_answers)

    def test_year_before_1970(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                pd.to_datetime("06/06/1965"),
                pd.to_datetime("03/02/2019"),
                pd.to_datetime("03/31/2019"),
                pd.to_datetime("03/30/2019"),
            ],
        )
        answers = primitive_func(array)
        correct_answers = [2, 1, 6, 5]
        np.testing.assert_array_equal(answers, correct_answers)

    def test_year_after_2038(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                pd.to_datetime("12/31/2040"),
                pd.to_datetime("01/01/2001"),
                pd.to_datetime("03/31/2019"),
                pd.to_datetime("03/30/2019"),
            ],
        )
        answers = primitive_func(array)
        correct_answers = [6, 1, 6, 5]
        np.testing.assert_array_equal(answers, correct_answers)

    def test_nan_input(self):
        primitive_func = self.primitive().get_function()
        array = pd.Series(
            [
                pd.to_datetime("03/01/2019"),
                np.nan,
                np.datetime64("NaT"),
                pd.to_datetime("03/30/2019"),
            ],
        )
        answers = primitive_func(array)
        correct_answers = [1, np.nan, np.nan, 5]
        np.testing.assert_array_equal(answers, correct_answers)

    def test_with_featuretools(self, es):
        transform, aggregation = find_applicable_primitives(self.primitive)
        primitive_instance = self.primitive()
        transform.append(primitive_instance)
        valid_dfs(es, aggregation, transform, self.primitive)


================================================
FILE: featuretools/tests/primitive_tests/utils.py
================================================
from inspect import signature

import pytest

from featuretools import (
    FeatureBase,
    calculate_feature_matrix,
    dfs,
    encode_features,
    list_primitives,
    load_features,
    save_features,
)
from featuretools.primitives.base import AggregationPrimitive, PrimitiveBase
from featuretools.tests.testing_utils import make_ecommerce_entityset

PRIMITIVES = list_primitives()


def get_number_from_offset(offset):
    """Extract the numeric element of a potential offset string.

    Args:
        offset (int, str): If offset is an integer, that value is returned. If offset is a string,
            it's assumed to be an offset string of the format nD where n is a single digit integer.

    Note: This helper utility should only be used with offset strings that only have one numeric character.
        Only the first character will be returned, so if an offset string 24H is used, it will incorrectly
        return the integer 2. Additionally, any of the offset timespans (H for hourly, D for daily, etc.)
        can be used here; however, care should be taken by the user to remember what that timespan is when
        writing tests, as comparing 7 from 7D to 1 from 1W may not behave as expected.
    """
    if isinstance(offset, str):
        return int(offset[0])
    else:
        return offset


class PrimitiveTestBase:
    primitive = None

    @pytest.fixture()
    def es(self):
        es = make_ecommerce_entityset()
        return es

    def test_name_and_desc(self):
        assert self.primitive.name is not None
        assert self.primitive.__doc__ is not None
        docstring = self.primitive.__doc__
        short_description = docstring.splitlines()[0]
        first_word = short_description.split(" ", 1)[0]
        valid_verbs = [
            "Calculates",
            "Determines",
            "Transforms",
            "Computes",
            "Shifts",
            "Extracts",
            "Applies",
        ]
        assert any(s in first_word for s in valid_verbs)
        assert self.primitive.input_types is not None

    def test_name_in_primitive_list(self):
        assert PRIMITIVES.name.eq(self.primitive.name).any()

    def test_arg_init(self):
        primitive_ = self.primitive()
        # determine the optional arguments in the __init__
        init_params = signature(self.primitive.__init__)
        for name, parameter in init_params.parameters.items():
            if parameter.default is not parameter.empty:
                assert hasattr(primitive_, name)

    def test_serialize(self, es, target_dataframe_name="log"):
        check_serialize(primitive=self.primitive, es=es, target_dataframe_name="log")


def check_serialize(primitive, es, target_dataframe_name="log"):
    trans_primitives = []
    agg_primitives = []
    if issubclass(primitive, AggregationPrimitive):
        agg_primitives = [primitive]
    else:
        trans_primitives = [primitive]
    features = dfs(
        entityset=es,
        target_dataframe_name=target_dataframe_name,
        agg_primitives=agg_primitives,
        trans_primitives=trans_primitives,
        max_features=-1,
        max_depth=3,
        features_only=True,
        return_types="all",
    )

    feat_to_serialize = None
    for feature in features:
        if feature.primitive.__class__ == primitive:
            feat_to_serialize = feature
            break
        for base_feature in feature.get_dependencies(deep=True):
            if base_feature.primitive.__class__ == primitive:
                feat_to_serialize = base_feature
                break
    assert feat_to_serialize is not None

    # Skip calculating feature matrix for long running primitives
    skip_primitives = ["elmo"]

    if primitive.name not in skip_primitives:
        df1 = calculate_feature_matrix([feat_to_serialize], entityset=es)

    new_feat = load_features(save_features([feat_to_serialize]))[0]
    assert isinstance(new_feat, FeatureBase)

    if primitive.name not in skip_primitives:
        df2 = calculate_feature_matrix([new_feat], entityset=es)
        assert df1.equals(df2)


def find_applicable_primitives(primitive):
    from featuretools.primitives.utils import (
        get_aggregation_primitives,
        get_transform_primitives,
    )

    all_transform_primitives = list(get_transform_primitives().values())
    all_aggregation_primitives = list(get_aggregation_primitives().values())
    applicable_transforms = find_stackable_primitives(
        all_transform_primitives,
        primitive,
    )
    applicable_aggregations = find_stackable_primitives(
        all_aggregation_primitives,
        primitive,
    )
    return applicable_transforms, applicable_aggregations


def find_stackable_primitives(all_primitives, primitive):
    applicable_primitives = []
    for x in all_primitives:
        if x.input_types == [primitive.return_type]:
            applicable_primitives.append(x)
    return applicable_primitives


def valid_dfs(
    es,
    aggregations,
    transforms,
    feature_substrings,
    target_dataframe_name="log",
    multi_output=False,
    max_depth=3,
    max_features=-1,
    instance_ids=[0, 1, 2, 3],
):
    if not isinstance(feature_substrings, list):
        feature_substrings = [feature_substrings]

    if any([issubclass(x, PrimitiveBase) for x in feature_substrings]):
        feature_substrings = [x.name.upper() for x in feature_substrings]

    features = dfs(
        entityset=es,
        target_dataframe_name=target_dataframe_name,
        agg_primitives=aggregations,
        trans_primitives=transforms,
        max_features=max_features,
        max_depth=max_depth,
        features_only=True,
    )
    applicable_features = []
    for feat in features:
        applicable_features += [
            feat for x in feature_substrings if x in feat.get_name()
        ]
    if len(applicable_features) == 0:
        raise ValueError(
            "No feature names with %s, verify the name attribute \
                          is defined and/or generate_name() is defined to \
                          return %s "
            % (feature_substrings, feature_substrings),
        )
    df = calculate_feature_matrix(
        entityset=es,
        features=applicable_features,
        instance_ids=instance_ids,
        n_jobs=1,
    )

    encode_features(df, applicable_features)

    # TODO: check the multi_output shape by checking
    # feature.number_output_features for each feature
    # and comparing it with the matrix shape
    if not multi_output:
        assert len(applicable_features) == df.shape[1]
    return


================================================
FILE: featuretools/tests/profiling/__init__.py
================================================


================================================
FILE: featuretools/tests/profiling/dfs_profile.py
================================================
"""
dfs_profile.py

Helper module to allow profiling of the dfs operations.  At some point we may
want to use pstats to output the results to a log, but I'm anticipating that
LookingGlass will provide the performance data we want.

Notes:
  - output currently goes to the root directory and is in dfs_profile.stats
  - *.stats is gitignored
  - it uses the demo customers dataset for testing
  - max_depth > 2 is very slow (currently)
  - stats output can be viewed online with https://nejc.saje.info/pstats-viewer.html
"""

import cProfile
from pathlib import Path

import featuretools as ft
import featuretools.demo as demo
from featuretools.synthesis.dfs import dfs

es = demo.load_retail()

all_aggs = ft.primitives.get_aggregation_primitives()
all_trans = ft.primitives.get_transform_primitives()

profiler = cProfile.Profile(builtins=False)
profiler.enable()
feature_defs = dfs(
    entityset=es,
    target_dataframe_name="customers",
    trans_primitives=all_trans,
    agg_primitives=all_aggs,
    max_depth=2,
    features_only=True,
)
profiler.disable()
profiler.dump_stats(Path.cwd() / "dfs_profile.stats")


================================================
FILE: featuretools/tests/requirement_files/latest_requirements.txt
================================================
cloudpickle==3.0.0
dask==2024.6.2
dask-expr==1.1.6
distributed==2024.6.2
holidays==0.51
numpy==1.26.4
pandas==2.2.2
psutil==6.0.0
scipy==1.13.1
tqdm==4.66.4
woodwork==0.31.0


================================================
FILE: featuretools/tests/requirement_files/minimum_core_requirements.txt
================================================
cloudpickle==1.5.0
holidays==0.17
numpy==1.25.0
packaging==20.0
pandas==2.0.0
psutil==5.7.0
scipy==1.10.0
tqdm==4.66.3
woodwork==0.28.0


================================================
FILE: featuretools/tests/requirement_files/minimum_dask_requirements.txt
================================================
cloudpickle==1.5.0
dask[dataframe]==2023.2.0
distributed==2023.2.0
holidays==0.17
numpy==1.25.0
packaging==20.0
pandas==2.0.0
psutil==5.7.0
scipy==1.10.0
tqdm==4.66.3
woodwork==0.28.0


================================================
FILE: featuretools/tests/requirement_files/minimum_test_requirements.txt
================================================
boto3==1.34.32
cloudpickle==1.5.0
composeml==0.8.0
graphviz==0.8.4
holidays==0.17
moto[all]==5.0.0
numpy==1.25.0
packaging==20.0
pandas==2.0.0
pip==23.3.0
psutil==5.7.0
pyarrow==14.0.1
pympler==0.8
pytest-cov==3.0.0
pytest-timeout==2.1.0
pytest-xdist==2.5.0
pytest==7.1.2
scipy==1.10.0
smart-open==5.0.0
tqdm==4.66.3
urllib3==1.26.18
woodwork==0.28.0


================================================
FILE: featuretools/tests/selection/__init__.py
================================================


================================================
FILE: featuretools/tests/selection/test_selection.py
================================================
import numpy as np
import pandas as pd
import pytest
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Boolean, BooleanNullable, NaturalLanguage

from featuretools import EntitySet, Feature, dfs
from featuretools.selection import (
    remove_highly_correlated_features,
    remove_highly_null_features,
    remove_low_information_features,
    remove_single_value_features,
)
from featuretools.tests.testing_utils import make_ecommerce_entityset


@pytest.fixture
def feature_matrix():
    feature_matrix = pd.DataFrame(
        {
            "test": [0, 1, 2],
            "no_null": [np.nan, 0, 0],
            "some_null": [np.nan, 0, 0],
            "all_null": [np.nan, np.nan, np.nan],
            "many_value": [1, 2, 3],
            "dup_value": [1, 1, 2],
            "one_value": [1, 1, 1],
        },
    )
    return feature_matrix


@pytest.fixture
def test_es(es, feature_matrix):
    es.add_dataframe(dataframe_name="test", dataframe=feature_matrix, index="test")
    return es


def test_remove_low_information_feature_names(feature_matrix):
    feature_matrix = remove_low_information_features(feature_matrix)
    assert feature_matrix.shape == (3, 5)
    assert "one_value" not in feature_matrix.columns
    assert "all_null" not in feature_matrix.columns


def test_remove_low_information_features(test_es, feature_matrix):
    features = [Feature(test_es["test"].ww[col]) for col in test_es["test"].columns]
    feature_matrix, features = remove_low_information_features(feature_matrix, features)
    assert feature_matrix.shape == (3, 5)
    assert len(features) == 5
    for f in features:
        assert f.get_name() in feature_matrix.columns
    assert "one_value" not in feature_matrix.columns
    assert "all_null" not in feature_matrix.columns


def test_remove_highly_null_features():
    nulls_df = pd.DataFrame(
        {
            "id": [0, 1, 2, 3],
            "half_nulls": [None, None, 88, 99],
            "all_nulls": [None, None, None, None],
            "quarter": ["a", "b", None, "c"],
            "vals": [True, True, False, False],
        },
    )

    es = EntitySet("data", {"nulls": (nulls_df, "id")})
    es["nulls"].ww.set_types(
        logical_types={"all_nulls": "categorical", "quarter": "categorical"},
    )
    fm, features = dfs(
        entityset=es,
        target_dataframe_name="nulls",
        trans_primitives=["is_null"],
        max_depth=1,
    )

    with pytest.raises(
        ValueError,
        match="pct_null_threshold must be a float between 0 and 1, inclusive.",
    ):
        remove_highly_null_features(fm, pct_null_threshold=1.1)

    with pytest.raises(
        ValueError,
        match="pct_null_threshold must be a float between 0 and 1, inclusive.",
    ):
        remove_highly_null_features(fm, pct_null_threshold=-0.1)

    no_thresh = remove_highly_null_features(fm)
    no_thresh_cols = set(no_thresh.columns)
    diff = set(fm.columns) - no_thresh_cols
    assert len(diff) == 1
    assert "all_nulls" not in no_thresh_cols

    half = remove_highly_null_features(fm, pct_null_threshold=0.5)
    half_cols = set(half.columns)
    diff = set(fm.columns) - half_cols
    assert len(diff) == 2
    assert "all_nulls" not in half_cols
    assert "half_nulls" not in half_cols

    no_tolerance = remove_highly_null_features(fm, pct_null_threshold=0)
    no_tolerance_cols = set(no_tolerance.columns)
    diff = set(fm.columns) - no_tolerance_cols
    assert len(diff) == 3
    assert "all_nulls" not in no_tolerance_cols
    assert "half_nulls" not in no_tolerance_cols
    assert "quarter" not in no_tolerance_cols

    (
        with_features_param,
        with_features_param_features,
    ) = remove_highly_null_features(fm, features)
    assert len(with_features_param_features) == len(no_thresh.columns)
    for i in range(len(with_features_param_features)):
        assert with_features_param_features[i].get_name() == no_thresh.columns[i]
        assert with_features_param.columns[i] == no_thresh.columns[i]


def test_remove_single_value_features():
    same_vals_df = pd.DataFrame(
        {
            "id": [0, 1, 2, 3],
            "all_numeric": [88, 88, 88, 88],
            "with_nan": [1, 1, None, 1],
            "all_nulls": [None, None, None, None],
            "all_categorical": ["a", "a", "a", "a"],
            "all_bools": [True, True, True, True],
            "diff_vals": ["hi", "bye", "bye", "hi"],
        },
    )

    es = EntitySet("data", {"single_vals": (same_vals_df, "id")})
    es["single_vals"].ww.set_types(
        logical_types={
            "all_nulls": "categorical",
            "all_categorical": "categorical",
            "diff_vals": "categorical",
        },
    )
    fm, features = dfs(
        entityset=es,
        target_dataframe_name="single_vals",
        trans_primitives=["is_null"],
        max_depth=1,
    )

    no_params, no_params_features = remove_single_value_features(fm, features)
    no_params_cols = set(no_params.columns)
    assert len(no_params_features) == 2
    assert "IS_NULL(with_nan)" in no_params_cols
    assert "diff_vals" in no_params_cols

    nan_as_value, nan_as_value_features = remove_single_value_features(
        fm,
        features,
        count_nan_as_value=True,
    )
    nan_cols = set(nan_as_value.columns)
    assert len(nan_as_value_features) == 3
    assert "IS_NULL(with_nan)" in nan_cols
    assert "diff_vals" in nan_cols
    assert "with_nan" in nan_cols

    without_features_param = remove_single_value_features(fm)
    assert len(no_params.columns) == len(without_features_param.columns)
    for i in range(len(no_params.columns)):
        assert no_params.columns[i] == without_features_param.columns[i]
        assert no_params_features[i].get_name() == without_features_param.columns[i]


def test_remove_highly_correlated_features():
    correlated_df = pd.DataFrame(
        {
            "id": [0, 1, 2, 3],
            "diff_ints": [34, 11, 29, 91],
            "words": ["test", "this is a short sentence", "foo bar", "baz"],
            "corr_words": [4, 24, 7, 3],
            "corr_1": [99, 88, 77, 33],
            "corr_2": [99, 88, 77, 33],
        },
    )

    es = EntitySet(
        "data",
        {"correlated": (correlated_df, "id", None, {"words": NaturalLanguage})},
    )
    fm, _ = dfs(
        entityset=es,
        target_dataframe_name="correlated",
        trans_primitives=["num_characters"],
        max_depth=1,
    )

    with pytest.raises(
        ValueError,
        match="pct_corr_threshold must be a float between 0 and 1, inclusive.",
    ):
        remove_highly_correlated_features(fm, pct_corr_threshold=1.1)

    with pytest.raises(
        ValueError,
        match="pct_corr_threshold must be a float between 0 and 1, inclusive.",
    ):
        remove_highly_correlated_features(fm, pct_corr_threshold=-0.1)

    with pytest.raises(
        AssertionError,
        match="feature named not_a_feature is not in feature matrix",
    ):
        remove_highly_correlated_features(fm, features_to_check=["not_a_feature"])

    to_check = remove_highly_correlated_features(
        fm,
        features_to_check=["corr_words", "NUM_CHARACTERS(words)", "diff_ints"],
    )
    to_check_columns = set(to_check.columns)
    assert len(to_check_columns) == 4
    assert "NUM_CHARACTERS(words)" not in to_check_columns
    assert "corr_1" in to_check_columns
    assert "corr_2" in to_check_columns

    to_keep = remove_highly_correlated_features(
        fm,
        features_to_keep=["NUM_CHARACTERS(words)"],
    )
    to_keep_names = set(to_keep.columns)
    assert len(to_keep_names) == 4
    assert "corr_words" in to_keep_names
    assert "NUM_CHARACTERS(words)" in to_keep_names
    assert "corr_2" not in to_keep_names

    new_fm = remove_highly_correlated_features(fm)
    assert len(new_fm.columns) == 3
    assert "corr_2" not in new_fm.columns
    assert "NUM_CHARACTERS(words)" not in new_fm.columns

    diff_threshold = remove_highly_correlated_features(fm, pct_corr_threshold=0.8)
    diff_threshold_cols = diff_threshold.columns
    assert len(diff_threshold_cols) == 2
    assert "corr_words" in diff_threshold_cols
    assert "diff_ints" in diff_threshold_cols


def test_remove_highly_correlated_features_init_woodwork():
    correlated_df = pd.DataFrame(
        {
            "id": [0, 1, 2, 3],
            "diff_ints": [34, 11, 29, 91],
            "words": ["test", "this is a short sentence", "foo bar", "baz"],
            "corr_words": [4, 24, 7, 3],
            "corr_1": [99, 88, 77, 33],
            "corr_2": [99, 88, 77, 33],
        },
    )

    es = EntitySet(
        "data",
        {"correlated": (correlated_df, "id", None, {"words": NaturalLanguage})},
    )
    fm, _ = dfs(
        entityset=es,
        target_dataframe_name="correlated",
        trans_primitives=["num_characters"],
        max_depth=1,
    )

    no_ww_fm = fm.copy()
    ww_fm = fm.copy()
    ww_fm.ww.init()

    new_no_ww_fm = remove_highly_correlated_features(no_ww_fm)
    new_ww_fm = remove_highly_correlated_features(ww_fm)

    pd.testing.assert_frame_equal(new_no_ww_fm, new_ww_fm)


def test_multi_output_selection():
    df1 = pd.DataFrame({"id": [0, 1, 2, 3]})

    df2 = pd.DataFrame(
        {
            "index": [0, 1, 2, 3],
            "first_id": [0, 1, 1, 3],
            "all_nulls": [None, None, None, None],
            "quarter": ["a", "b", None, "c"],
        },
    )

    dataframes = {
        "first": (df1, "id"),
        "second": (df2, "index"),
    }

    relationships = [("first", "id", "second", "first_id")]
    es = EntitySet("data", dataframes, relationships=relationships)
    es["second"].ww.set_types(
        logical_types={"all_nulls": "categorical", "quarter": "categorical"},
    )

    fm, features = dfs(
        entityset=es,
        target_dataframe_name="first",
        trans_primitives=[],
        agg_primitives=["n_most_common"],
        max_depth=1,
    )

    multi_output, multi_output_features = remove_single_value_features(fm, features)
    assert multi_output.columns == ["N_MOST_COMMON(second.quarter)[0]"]
    assert len(multi_output_features) == 1
    assert multi_output_features[0].get_name() == multi_output.columns[0]

    es = make_ecommerce_entityset()
    fm, features = dfs(
        entityset=es,
        target_dataframe_name="régions",
        trans_primitives=[],
        agg_primitives=["n_most_common"],
        max_depth=2,
    )

    matrix_with_slices, unsliced_features = remove_highly_null_features(fm, features)
    assert len(matrix_with_slices.columns) == 18
    assert len(unsliced_features) == 14

    matrix_columns = set(matrix_with_slices.columns)
    for f in unsliced_features:
        for f_name in f.get_feature_names():
            assert f_name in matrix_columns


def test_remove_highly_correlated_features_on_boolean_cols():
    correlated_df = pd.DataFrame(
        {
            "id": [0, 1, 2, 3],
            "diff_ints": [34, 11, 29, 91],
            "corr_words": [4, 24, 7, 3],
            "bools": [True, True, False, True],
        },
    )

    es = EntitySet(
        "data",
        {"correlated": (correlated_df, "id", None, {"bools": Boolean})},
    )

    feature_matrix, features = dfs(
        entityset=es,
        target_dataframe_name="correlated",
        trans_primitives=["equal"],
        agg_primitives=[],
        max_depth=1,
        return_types=[
            ColumnSchema(logical_type=BooleanNullable),
            ColumnSchema(logical_type=Boolean),
        ],
    )
    # Confirm both boolean logical types are included so that we know we're checking the correct types
    assert {
        ltype.type_string for ltype in feature_matrix.ww.logical_types.values()
    } == {Boolean.type_string, BooleanNullable.type_string}

    to_keep = remove_highly_correlated_features(
        feature_matrix=feature_matrix,
        features=features,
        pct_corr_threshold=0.3,
    )
    assert len(to_keep[0].columns) < len(feature_matrix.columns)


================================================
FILE: featuretools/tests/synthesis/__init__.py
================================================


================================================
FILE: featuretools/tests/synthesis/test_deep_feature_synthesis.py
================================================
import copy
import re

import pandas as pd
import pytest
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Datetime

from featuretools import EntitySet, Feature, GroupByTransformFeature
from featuretools.entityset.entityset import LTI_COLUMN_NAME
from featuretools.feature_base import (
    AggregationFeature,
    DirectFeature,
    IdentityFeature,
    TransformFeature,
)
from featuretools.feature_base.utils import is_valid_input
from featuretools.primitives import (
    Absolute,
    AddNumeric,
    Count,
    CumCount,
    CumMean,
    CumMin,
    CumSum,
    Day,
    Diff,
    Equal,
    Hour,
    IsIn,
    IsNull,
    Last,
    Mean,
    Mode,
    Month,
    Negate,
    NMostCommon,
    Not,
    NotEqual,
    NumCharacters,
    NumTrue,
    NumUnique,
    RollingCount,
    RollingMax,
    RollingMean,
    RollingMin,
    RollingOutlierCount,
    RollingSTD,
    Sum,
    TimeSincePrevious,
    TransformPrimitive,
    Trend,
    Year,
)
from featuretools.synthesis import DeepFeatureSynthesis
from featuretools.tests.testing_utils import (
    feature_with_name,
    make_ecommerce_entityset,
    number_of_features_with_name_like,
)


def test_makes_agg_features_from_str(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=["sum"],
        trans_primitives=[],
    )

    features = dfs_obj.build_features()
    assert feature_with_name(features, "SUM(log.value)")


def test_makes_agg_features_from_mixed_str(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Count, "sum"],
        trans_primitives=[],
    )

    features = dfs_obj.build_features()
    assert feature_with_name(features, "SUM(log.value)")
    assert feature_with_name(features, "COUNT(log)")


def test_makes_agg_features(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Sum],
        trans_primitives=[],
    )

    features = dfs_obj.build_features()
    assert feature_with_name(features, "SUM(log.value)")


def test_only_makes_supplied_agg_feat(es):
    kwargs = dict(
        target_dataframe_name="customers",
        entityset=es,
        max_depth=3,
    )
    dfs_obj = DeepFeatureSynthesis(agg_primitives=[Sum], **kwargs)

    features = dfs_obj.build_features()

    def find_other_agg_features(features):
        return [
            f
            for f in features
            if (isinstance(f, AggregationFeature) and not isinstance(f.primitive, Sum))
            or len(
                [
                    g
                    for g in f.base_features
                    if isinstance(g, AggregationFeature)
                    and not isinstance(g.primitive, Sum)
                ],
            )
            > 0
        ]

    other_agg_features = find_other_agg_features(features)
    assert len(other_agg_features) == 0


def test_error_for_missing_target_dataframe(es):
    error_text = (
        "Provided target dataframe missing_dataframe does not exist in ecommerce"
    )
    with pytest.raises(KeyError, match=error_text):
        DeepFeatureSynthesis(
            target_dataframe_name="missing_dataframe",
            entityset=es,
            agg_primitives=[Last],
            trans_primitives=[],
            ignore_dataframes=["log"],
        )

    es_without_id = EntitySet()
    error_text = (
        "Provided target dataframe missing_dataframe does not exist in entity set"
    )
    with pytest.raises(KeyError, match=error_text):
        DeepFeatureSynthesis(
            target_dataframe_name="missing_dataframe",
            entityset=es_without_id,
            agg_primitives=[Last],
            trans_primitives=[],
            ignore_dataframes=["log"],
        )


def test_ignores_dataframes(es):
    error_text = "ignore_dataframes must be a list"
    with pytest.raises(TypeError, match=error_text):
        DeepFeatureSynthesis(
            target_dataframe_name="sessions",
            entityset=es,
            agg_primitives=[Sum],
            trans_primitives=[],
            ignore_dataframes="log",
        )

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Sum],
        trans_primitives=[],
        ignore_dataframes=["log"],
    )

    features = dfs_obj.build_features()
    for f in features:
        deps = f.get_dependencies(deep=True)
        dataframes = [d.dataframe_name for d in deps]
        assert "log" not in dataframes


def test_ignores_columns(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Sum],
        trans_primitives=[],
        ignore_columns={"log": ["value"]},
    )
    features = dfs_obj.build_features()
    for f in features:
        deps = f.get_dependencies(deep=True)
        identities = [d for d in deps if isinstance(d, IdentityFeature)]
        columns = [d.column_name for d in identities if d.dataframe_name == "log"]
        assert "value" not in columns


def test_ignore_columns_input_type(es):
    error_msg = r"ignore_columns should be dict\[str -> list\]"  # need to use string literals to avoid regex params
    wrong_input_type = {"log": "value"}
    with pytest.raises(TypeError, match=error_msg):
        DeepFeatureSynthesis(
            target_dataframe_name="log",
            entityset=es,
            ignore_columns=wrong_input_type,
        )


def test_ignore_columns_with_nonstring_values(es):
    error_msg = "list in ignore_columns must only have string values"
    wrong_input_list = {"log": ["a", "b", 3]}
    with pytest.raises(TypeError, match=error_msg):
        DeepFeatureSynthesis(
            target_dataframe_name="log",
            entityset=es,
            ignore_columns=wrong_input_list,
        )


def test_ignore_columns_with_nonstring_keys(es):
    error_msg = r"ignore_columns should be dict\[str -> list\]"  # need to use string literals to avoid regex params
    wrong_input_keys = {1: ["a", "b", "c"]}
    with pytest.raises(TypeError, match=error_msg):
        DeepFeatureSynthesis(
            target_dataframe_name="log",
            entityset=es,
            ignore_columns=wrong_input_keys,
        )


def test_makes_dfeatures(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[],
    )

    features = dfs_obj.build_features()
    assert feature_with_name(features, "customers.age")


def test_makes_trans_feat(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[Hour],
    )

    features = dfs_obj.build_features()
    assert feature_with_name(features, "HOUR(datetime)")


def test_handles_diff_dataframe_groupby(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[],
        groupby_trans_primitives=[Diff],
    )

    features = dfs_obj.build_features()
    assert feature_with_name(features, "DIFF(value) by session_id")
    assert feature_with_name(features, "DIFF(value) by product_id")


def test_handles_time_since_previous_dataframe_groupby(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[],
        groupby_trans_primitives=[TimeSincePrevious],
    )

    features = dfs_obj.build_features()
    assert feature_with_name(features, "TIME_SINCE_PREVIOUS(datetime) by session_id")


# M TODO
# def test_handles_cumsum_dataframe_groupby(es):
#     dfs_obj = DeepFeatureSynthesis(target_dataframe_name='sessions',
#                                    entityset=es,
#                                    agg_primitives=[],
#                                    trans_primitives=[CumMean])

#     features = dfs_obj.build_features()
#     assert (feature_with_name(features, u'customers.CUM_MEAN(age by région_id)'))


def test_only_makes_supplied_trans_feat(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[Hour],
    )

    features = dfs_obj.build_features()
    other_trans_features = [
        f
        for f in features
        if (isinstance(f, TransformFeature) and not isinstance(f.primitive, Hour))
        or len(
            [
                g
                for g in f.base_features
                if isinstance(g, TransformFeature) and not isinstance(g.primitive, Hour)
            ],
        )
        > 0
    ]
    assert len(other_trans_features) == 0


def test_makes_dfeatures_of_agg_primitives(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=["max"],
        trans_primitives=[],
    )
    features = dfs_obj.build_features()

    assert feature_with_name(features, "customers.MAX(log.value)")


def test_makes_agg_features_of_trans_primitives(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Mean],
        trans_primitives=[NumCharacters],
    )

    features = dfs_obj.build_features()
    assert feature_with_name(features, "MEAN(log.NUM_CHARACTERS(comments))")


def test_makes_agg_features_with_where(es):
    es.add_interesting_values()

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Count],
        where_primitives=[Count],
        trans_primitives=[],
    )

    features = dfs_obj.build_features()
    assert feature_with_name(features, "COUNT(log WHERE priority_level = 0)")

    # make sure they are made using direct features too
    assert feature_with_name(features, "COUNT(log WHERE products.department = food)")


def test_make_groupby_features(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[],
        groupby_trans_primitives=["cum_sum"],
    )
    features = dfs_obj.build_features()
    assert feature_with_name(features, "CUM_SUM(value) by session_id")


def test_make_indirect_groupby_features(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[],
        groupby_trans_primitives=["cum_sum"],
    )
    features = dfs_obj.build_features()
    assert feature_with_name(features, "CUM_SUM(products.rating) by session_id")


def test_make_groupby_features_with_id(es):
    # Need to convert customer_id to categorical column in order to build desired feature
    es["sessions"].ww.set_types(
        logical_types={"customer_id": "Categorical"},
        semantic_tags={"customer_id": "foreign_key"},
    )
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[],
        groupby_trans_primitives=["cum_count"],
    )
    features = dfs_obj.build_features()

    assert feature_with_name(features, "CUM_COUNT(customer_id) by customer_id")


def test_make_groupby_features_with_diff_id(es):
    # Need to convert cohort to categorical column in order to build desired feature
    es["customers"].ww.set_types(
        logical_types={"cohort": "Categorical"},
        semantic_tags={"cohort": "foreign_key"},
    )
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="customers",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[],
        groupby_trans_primitives=["cum_count"],
    )
    features = dfs_obj.build_features()

    groupby_with_diff_id = "CUM_COUNT(cohort) by région_id"
    assert feature_with_name(features, groupby_with_diff_id)


def test_make_groupby_features_with_agg(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="cohorts",
        entityset=es,
        agg_primitives=["sum"],
        trans_primitives=[],
        groupby_trans_primitives=["cum_sum"],
    )
    features = dfs_obj.build_features()
    agg_on_groupby_name = "SUM(customers.CUM_SUM(age) by région_id)"
    assert feature_with_name(features, agg_on_groupby_name)


def test_bad_groupby_feature(es):
    msg = re.escape(
        "Unknown groupby transform primitive max. "
        "Call ft.primitives.list_primitives() to get "
        "a list of available primitives",
    )
    with pytest.raises(ValueError, match=msg):
        DeepFeatureSynthesis(
            target_dataframe_name="customers",
            entityset=es,
            agg_primitives=["sum"],
            trans_primitives=[],
            groupby_trans_primitives=["Max"],
        )


@pytest.mark.parametrize(
    "rolling_primitive",
    [
        RollingMax,
        RollingMean,
        RollingMin,
        RollingOutlierCount,
        RollingSTD,
    ],
)
@pytest.mark.parametrize(
    "window_length, gap",
    [
        (7, 3),
        ("7d", "3d"),
    ],
)
def test_make_rolling_features(window_length, gap, rolling_primitive, es):
    rolling_primitive_obj = rolling_primitive(
        window_length=window_length,
        gap=gap,
        min_periods=5,
    )
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[rolling_primitive_obj],
    )
    features = dfs_obj.build_features()
    rolling_transform_name = f"{rolling_primitive.name.upper()}(datetime, value_many_nans, window_length={window_length}, gap={gap}, min_periods=5)"
    assert feature_with_name(features, rolling_transform_name)


@pytest.mark.parametrize(
    "window_length, gap",
    [
        (7, 3),
        ("7d", "3d"),
    ],
)
def test_make_rolling_count_off_datetime_feature(window_length, gap, es):
    rolling_count = RollingCount(window_length=window_length, min_periods=gap)
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[rolling_count],
    )
    features = dfs_obj.build_features()
    rolling_transform_name = (
        f"ROLLING_COUNT(datetime, window_length={window_length}, min_periods={gap})"
    )
    assert feature_with_name(features, rolling_transform_name)


def test_abides_by_max_depth_param(es):
    for i in [0, 1, 2, 3]:
        dfs_obj = DeepFeatureSynthesis(
            target_dataframe_name="sessions",
            entityset=es,
            agg_primitives=[Sum],
            trans_primitives=[],
            max_depth=i,
        )

        features = dfs_obj.build_features()
        for f in features:
            assert f.get_depth() <= i


def test_max_depth_single_table(transform_es):
    assert len(transform_es.dataframe_dict) == 1

    def make_dfs_obj(max_depth):
        dfs_obj = DeepFeatureSynthesis(
            target_dataframe_name="first",
            entityset=transform_es,
            trans_primitives=[AddNumeric],
            max_depth=max_depth,
        )
        return dfs_obj

    for i in [-1, 0, 1, 2]:
        if i in [-1, 2]:
            match = (
                "Only one dataframe in entityset, changing max_depth to 1 "
                "since deeper features cannot be created"
            )
            with pytest.warns(UserWarning, match=match):
                dfs_obj = make_dfs_obj(i)
        else:
            dfs_obj = make_dfs_obj(i)

        features = dfs_obj.build_features()
        assert len(features) > 0
        if i != 0:
            # at least one depth 1 feature made
            assert any([f.get_depth() == 1 for f in features])
            # no depth 2 or higher even with max_depth=2
            assert all([f.get_depth() <= 1 for f in features])
        else:
            # no depth 1 or higher features with max_depth=0
            assert all([f.get_depth() == 0 for f in features])


def test_drop_contains(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Sum],
        trans_primitives=[],
        max_depth=1,
        seed_features=[],
        drop_contains=[],
    )
    features = dfs_obj.build_features()
    to_drop = features[2]
    partial_name = to_drop.get_name()[:5]

    dfs_drop = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Sum],
        trans_primitives=[],
        max_depth=1,
        seed_features=[],
        drop_contains=[partial_name],
    )
    features = dfs_drop.build_features()
    assert to_drop.get_name() not in [f.get_name() for f in features]


def test_drop_exact(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Sum],
        trans_primitives=[],
        max_depth=1,
        seed_features=[],
        drop_exact=[],
    )
    features = dfs_obj.build_features()
    to_drop = features[2]
    name = to_drop.get_name()
    dfs_drop = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Sum],
        trans_primitives=[],
        max_depth=1,
        seed_features=[],
        drop_exact=[name],
    )
    features = dfs_drop.build_features()
    assert name not in [f.get_name() for f in features]


def test_seed_features(es):
    seed_feature_sessions = (
        Feature(es["log"].ww["id"], parent_dataframe_name="sessions", primitive=Count)
        > 2
    )
    seed_feature_log = Feature(es["log"].ww["comments"], primitive=NumCharacters)
    session_agg = Feature(
        seed_feature_log,
        parent_dataframe_name="sessions",
        primitive=Mean,
    )
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Mean],
        trans_primitives=[],
        max_depth=2,
        seed_features=[seed_feature_sessions, seed_feature_log],
    )
    features = dfs_obj.build_features()
    assert seed_feature_sessions.get_name() in [f.get_name() for f in features]
    assert session_agg.get_name() in [f.get_name() for f in features]


def test_does_not_make_agg_of_direct_of_target_dataframe(es):
    count_sessions = Feature(
        es["sessions"].ww["id"],
        parent_dataframe_name="customers",
        primitive=Count,
    )
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="customers",
        entityset=es,
        agg_primitives=[Last],
        trans_primitives=[],
        max_depth=2,
        seed_features=[count_sessions],
    )
    features = dfs_obj.build_features()
    # this feature is meaningless because customers.COUNT(sessions) is already defined on
    # the customers dataframe
    assert not feature_with_name(features, "LAST(sessions.customers.COUNT(sessions))")
    assert not feature_with_name(features, "LAST(sessions.customers.age)")


def test_dfs_builds_on_seed_features_more_than_max_depth(es):
    seed_feature_sessions = Feature(
        es["log"].ww["id"],
        parent_dataframe_name="sessions",
        primitive=Count,
    )
    seed_feature_log = Feature(es["log"].ww["datetime"], primitive=Hour)
    session_agg = Feature(
        seed_feature_log,
        parent_dataframe_name="sessions",
        primitive=Last,
    )

    # Depth of this feat is 2 relative to session_agg, the seed feature,
    # which is greater than max_depth so it shouldn't be built
    session_agg_trans = DirectFeature(
        Feature(session_agg, parent_dataframe_name="customers", primitive=Mode),
        "sessions",
    )
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Last, Count],
        trans_primitives=[],
        max_depth=1,
        seed_features=[seed_feature_sessions, seed_feature_log],
    )
    features = dfs_obj.build_features()
    assert seed_feature_sessions.get_name() in [f.get_name() for f in features]
    assert session_agg.get_name() in [f.get_name() for f in features]
    assert session_agg_trans.get_name() not in [f.get_name() for f in features]


def test_dfs_includes_seed_features_greater_than_max_depth(es):
    session_agg = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="sessions",
        primitive=Sum,
    )
    customer_agg = Feature(
        session_agg,
        parent_dataframe_name="customers",
        primitive=Mean,
    )
    assert customer_agg.get_depth() == 2

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="customers",
        entityset=es,
        agg_primitives=[Mean],
        trans_primitives=[],
        max_depth=1,
        seed_features=[customer_agg],
    )
    features = dfs_obj.build_features()
    assert feature_with_name(features=features, name=customer_agg.get_name())


def test_allowed_paths(es):
    kwargs = dict(
        target_dataframe_name="customers",
        entityset=es,
        agg_primitives=[Last],
        trans_primitives=[],
        max_depth=2,
        seed_features=[],
    )
    dfs_unconstrained = DeepFeatureSynthesis(**kwargs)
    features_unconstrained = dfs_unconstrained.build_features()

    unconstrained_names = [f.get_name() for f in features_unconstrained]
    customers_session_feat = Feature(
        es["sessions"].ww["device_type"],
        parent_dataframe_name="customers",
        primitive=Last,
    )
    customers_session_log_feat = Feature(
        es["log"].ww["value"],
        parent_dataframe_name="customers",
        primitive=Last,
    )
    assert customers_session_feat.get_name() in unconstrained_names
    assert customers_session_log_feat.get_name() in unconstrained_names

    dfs_constrained = DeepFeatureSynthesis(
        allowed_paths=[["customers", "sessions"]], **kwargs
    )
    features = dfs_constrained.build_features()
    names = [f.get_name() for f in features]
    assert customers_session_feat.get_name() in names
    assert customers_session_log_feat.get_name() not in names


def test_max_features(es):
    kwargs = dict(
        target_dataframe_name="customers",
        entityset=es,
        agg_primitives=[Sum],
        trans_primitives=[],
        max_depth=2,
        seed_features=[],
    )
    dfs_unconstrained = DeepFeatureSynthesis(**kwargs)
    features_unconstrained = dfs_unconstrained.build_features()
    dfs_unconstrained_with_arg = DeepFeatureSynthesis(max_features=-1, **kwargs)
    feats_unconstrained_with_arg = dfs_unconstrained_with_arg.build_features()
    dfs_constrained = DeepFeatureSynthesis(max_features=1, **kwargs)
    features = dfs_constrained.build_features()
    assert len(features_unconstrained) == len(feats_unconstrained_with_arg)
    assert len(features) == 1


def test_where_primitives(es):
    es.add_interesting_values(dataframe_name="sessions", values={"device_type": [0]})
    kwargs = dict(
        target_dataframe_name="customers",
        entityset=es,
        agg_primitives=[Count, Sum],
        trans_primitives=[Absolute],
        max_depth=3,
    )
    dfs_unconstrained = DeepFeatureSynthesis(**kwargs)
    dfs_constrained = DeepFeatureSynthesis(where_primitives=["sum"], **kwargs)
    features_unconstrained = dfs_unconstrained.build_features()
    features = dfs_constrained.build_features()

    where_feats_unconstrained = [
        f
        for f in features_unconstrained
        if isinstance(f, AggregationFeature) and f.where is not None
    ]
    where_feats = [
        f for f in features if isinstance(f, AggregationFeature) and f.where is not None
    ]

    assert len(where_feats_unconstrained) >= 1

    assert (
        len([f for f in where_feats_unconstrained if isinstance(f.primitive, Sum)]) == 0
    )
    assert (
        len([f for f in where_feats_unconstrained if isinstance(f.primitive, Count)])
        > 0
    )

    assert len([f for f in where_feats if isinstance(f.primitive, Sum)]) > 0
    assert len([f for f in where_feats if isinstance(f.primitive, Count)]) == 0
    assert (
        len(
            [
                d
                for f in where_feats
                for d in f.get_dependencies(deep=True)
                if isinstance(d.primitive, Absolute)
            ],
        )
        > 0
    )


def test_stacking_where_primitives(es):
    es = copy.deepcopy(es)
    es.add_interesting_values(dataframe_name="sessions", values={"device_type": [0]})
    es.add_interesting_values(
        dataframe_name="log",
        values={"product_id": ["coke_zero"]},
    )
    kwargs = dict(
        target_dataframe_name="customers",
        entityset=es,
        agg_primitives=[Count, Last],
        max_depth=3,
    )
    dfs_where_stack_limit_1 = DeepFeatureSynthesis(
        where_primitives=["last", Count], **kwargs
    )
    dfs_where_stack_limit_2 = DeepFeatureSynthesis(
        where_primitives=["last", Count], where_stacking_limit=2, **kwargs
    )
    stack_limit_1_features = dfs_where_stack_limit_1.build_features()
    stack_limit_2_features = dfs_where_stack_limit_2.build_features()

    where_stack_1_feats = [
        f
        for f in stack_limit_1_features
        if isinstance(f, AggregationFeature) and f.where is not None
    ]
    where_stack_2_feats = [
        f
        for f in stack_limit_2_features
        if isinstance(f, AggregationFeature) and f.where is not None
    ]

    assert len(where_stack_1_feats) >= 1
    assert len(where_stack_2_feats) >= 1

    assert len([f for f in where_stack_1_feats if isinstance(f.primitive, Last)]) > 0
    assert len([f for f in where_stack_1_feats if isinstance(f.primitive, Count)]) > 0

    assert len([f for f in where_stack_2_feats if isinstance(f.primitive, Last)]) > 0
    assert len([f for f in where_stack_2_feats if isinstance(f.primitive, Count)]) > 0

    stacked_where_limit_1_feats = []
    stacked_where_limit_2_feats = []
    where_double_where_tuples = [
        (where_stack_1_feats, stacked_where_limit_1_feats),
        (where_stack_2_feats, stacked_where_limit_2_feats),
    ]
    for where_list, double_where_list in where_double_where_tuples:
        for feature in where_list:
            for base_feat in feature.base_features:
                if (
                    isinstance(base_feat, AggregationFeature)
                    and base_feat.where is not None
                ):
                    double_where_list.append(feature)

    assert len(stacked_where_limit_1_feats) == 0
    assert len(stacked_where_limit_2_feats) > 0


def test_where_different_base_feats(es):
    es.add_interesting_values(dataframe_name="sessions", values={"device_type": [0]})

    kwargs = dict(
        target_dataframe_name="customers",
        entityset=es,
        agg_primitives=[Sum, Count],
        where_primitives=[Sum, Count],
        max_depth=3,
    )
    dfs_unconstrained = DeepFeatureSynthesis(**kwargs)
    features = dfs_unconstrained.build_features()
    where_feats = [
        f.unique_name()
        for f in features
        if isinstance(f, AggregationFeature) and f.where is not None
    ]
    not_where_feats = [
        f.unique_name()
        for f in features
        if isinstance(f, AggregationFeature) and f.where is None
    ]
    for name in not_where_feats:
        assert name not in where_feats


def test_dfeats_where(es):
    es.add_interesting_values()

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Count],
        trans_primitives=[],
    )

    features = dfs_obj.build_features()

    # test to make sure we build direct features of agg features with where clause
    assert feature_with_name(features, "customers.COUNT(log WHERE priority_level = 0)")

    assert feature_with_name(
        features,
        "COUNT(log WHERE products.department = electronics)",
    )


def test_commutative(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[Sum],
        trans_primitives=[AddNumeric],
        max_depth=3,
    )
    feats = dfs_obj.build_features()

    add_feats = [f for f in feats if isinstance(f.primitive, AddNumeric)]

    # Check that there are no two AddNumeric features with the same base
    # features.
    unordered_args = set()
    for f in add_feats:
        arg1, arg2 = f.base_features
        args_set = frozenset({arg1.unique_name(), arg2.unique_name()})
        unordered_args.add(args_set)

    assert len(add_feats) == len(unordered_args)


def test_transform_consistency(transform_es):
    # Generate features
    transform_es["first"].ww.set_types(
        logical_types={"b": "BooleanNullable", "b1": "BooleanNullable"},
    )
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="first",
        entityset=transform_es,
        trans_primitives=["and", "add_numeric", "or"],
        max_depth=1,
    )
    feature_defs = dfs_obj.build_features()

    # Check for correct ordering of features
    assert feature_with_name(feature_defs, "a")
    assert feature_with_name(feature_defs, "b")
    assert feature_with_name(feature_defs, "b1")
    assert feature_with_name(feature_defs, "b12")
    assert feature_with_name(feature_defs, "P")

    assert feature_with_name(feature_defs, "AND(b, b1)")
    assert not feature_with_name(
        feature_defs,
        "AND(b1, b)",
    )  # make sure it doesn't exist the other way
    assert feature_with_name(feature_defs, "a + P")
    assert feature_with_name(feature_defs, "b12 + P")
    assert feature_with_name(feature_defs, "a + b12")
    assert feature_with_name(feature_defs, "OR(b, b1)")


def test_transform_no_stack_agg(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="customers",
        entityset=es,
        agg_primitives=[NMostCommon],
        trans_primitives=[NotEqual],
        max_depth=3,
    )
    feature_defs = dfs_obj.build_features()

    assert not feature_with_name(
        feature_defs,
        "id != N_MOST_COMMON(sessions.device_type)",
    )


def test_initialized_trans_prim(es):
    prim = IsIn(list_of_outputs=["coke zero"])
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[prim],
    )

    features = dfs_obj.build_features()

    assert feature_with_name(features, "product_id.isin(['coke zero'])")


def test_initialized_agg_prim(es):
    ThreeMost = NMostCommon(n=3)
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[ThreeMost],
        trans_primitives=[],
    )
    features = dfs_obj.build_features()

    assert feature_with_name(features, "N_MOST_COMMON(log.subregioncode)")


def test_return_types(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[Count, NMostCommon],
        trans_primitives=[Absolute, Hour, IsIn],
    )

    discrete = ColumnSchema(semantic_tags={"category"})
    numeric = ColumnSchema(semantic_tags={"numeric"})
    datetime = ColumnSchema(logical_type=Datetime)

    f1 = dfs_obj.build_features(return_types=None)
    f2 = dfs_obj.build_features(return_types=[discrete])
    f3 = dfs_obj.build_features(return_types="all")
    f4 = dfs_obj.build_features(return_types=[datetime])

    f1_types = [f.column_schema for f in f1]
    f2_types = [f.column_schema for f in f2]
    f3_types = [f.column_schema for f in f3]
    f4_types = [f.column_schema for f in f4]

    assert any([is_valid_input(schema, discrete) for schema in f1_types])
    assert any([is_valid_input(schema, numeric) for schema in f1_types])
    assert not any([is_valid_input(schema, datetime) for schema in f1_types])

    assert any([is_valid_input(schema, discrete) for schema in f2_types])
    assert not any([is_valid_input(schema, numeric) for schema in f2_types])
    assert not any([is_valid_input(schema, datetime) for schema in f2_types])

    assert any([is_valid_input(schema, discrete) for schema in f3_types])
    assert any([is_valid_input(schema, numeric) for schema in f3_types])
    assert any([is_valid_input(schema, datetime) for schema in f3_types])

    assert not any([is_valid_input(schema, discrete) for schema in f4_types])
    assert not any([is_valid_input(schema, numeric) for schema in f4_types])
    assert any([is_valid_input(schema, datetime) for schema in f4_types])


def test_checks_primitives_correct_type(es):
    error_text = (
        "Primitive <class \\'featuretools\\.primitives\\.standard\\."
        "transform\\.datetime\\.hour\\.Hour\\'> in "
        "agg_primitives is not an aggregation primitive"
    )
    with pytest.raises(ValueError, match=error_text):
        DeepFeatureSynthesis(
            target_dataframe_name="sessions",
            entityset=es,
            agg_primitives=[Hour],
            trans_primitives=[],
        )

    error_text = (
        "Primitive <class \\'featuretools\\.primitives\\.standard\\."
        "aggregation\\.sum_primitive\\.Sum\\'> in trans_primitives "
        "is not a transform primitive"
    )
    with pytest.raises(ValueError, match=error_text):
        DeepFeatureSynthesis(
            target_dataframe_name="sessions",
            entityset=es,
            agg_primitives=[],
            trans_primitives=[Sum],
        )


def test_makes_agg_features_along_multiple_paths(diamond_es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="regions",
        entityset=diamond_es,
        agg_primitives=["mean"],
        trans_primitives=[],
    )

    features = dfs_obj.build_features()
    assert feature_with_name(features, "MEAN(customers.transactions.amount)")
    assert feature_with_name(features, "MEAN(stores.transactions.amount)")


def test_makes_direct_features_through_multiple_relationships(games_es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="games",
        entityset=games_es,
        agg_primitives=["mean"],
        trans_primitives=[],
    )

    features = dfs_obj.build_features()

    teams = ["home", "away"]
    for forward in teams:
        for backward in teams:
            for col in teams:
                f = "teams[%s_team_id].MEAN(games[%s_team_id].%s_team_score)" % (
                    forward,
                    backward,
                    col,
                )
                assert feature_with_name(features, f)


def test_stacks_multioutput_features(es):
    class TestTime(TransformPrimitive):
        name = "test_time"
        input_types = [ColumnSchema(logical_type=Datetime)]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        number_output_features = 6

        def get_function(self):
            def test_f(x):
                times = pd.Series(x)
                units = ["year", "month", "day", "hour", "minute", "second"]
                return [times.apply(lambda x: getattr(x, unit)) for unit in units]

            return test_f

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="customers",
        entityset=es,
        agg_primitives=[NumUnique, NMostCommon(n=3)],
        trans_primitives=[TestTime, Diff],
        max_depth=4,
    )
    feat = dfs_obj.build_features()

    for i in range(3):
        f = "NUM_UNIQUE(sessions.N_MOST_COMMON(log.countrycode)[%d])" % i
        assert feature_with_name(feat, f)


def test_seed_multi_output_feature_stacking(es):
    threecommon = NMostCommon(3)
    tc = Feature(
        es["log"].ww["product_id"],
        parent_dataframe_name="sessions",
        primitive=threecommon,
    )

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="customers",
        entityset=es,
        seed_features=[tc],
        agg_primitives=[NumUnique],
        trans_primitives=[],
        max_depth=4,
    )
    feat = dfs_obj.build_features()

    for i in range(3):
        f = "NUM_UNIQUE(sessions.N_MOST_COMMON(log.product_id)[%d])" % i
        assert feature_with_name(feat, f)


def test_makes_direct_features_along_multiple_paths(diamond_es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="transactions",
        entityset=diamond_es,
        max_depth=3,
        agg_primitives=[],
        trans_primitives=[],
    )

    features = dfs_obj.build_features()
    assert feature_with_name(features, "customers.regions.name")
    assert feature_with_name(features, "stores.regions.name")


def test_does_not_make_trans_of_single_direct_feature(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[],
        trans_primitives=["weekday"],
        max_depth=2,
    )

    features = dfs_obj.build_features()

    assert not feature_with_name(features, "WEEKDAY(customers.signup_date)")
    assert feature_with_name(features, "customers.WEEKDAY(signup_date)")


def test_makes_trans_of_multiple_direct_features(diamond_es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="transactions",
        entityset=diamond_es,
        agg_primitives=["mean"],
        trans_primitives=[Equal],
        max_depth=4,
    )

    features = dfs_obj.build_features()

    # Make trans of direct and non-direct
    assert feature_with_name(features, "amount = stores.MEAN(transactions.amount)")

    # Make trans of direct features on different dataframes
    assert feature_with_name(
        features,
        "customers.MEAN(transactions.amount) = stores.square_ft",
    )

    # Make trans of direct features on same dataframe with different paths.
    assert feature_with_name(features, "customers.regions.name = stores.regions.name")

    # Don't make trans of direct features with same path.
    assert not feature_with_name(
        features,
        "stores.square_ft = stores.MEAN(transactions.amount)",
    )
    assert not feature_with_name(
        features,
        "stores.MEAN(transactions.amount) = stores.square_ft",
    )

    # The naming of the below is confusing but this is a direct feature of a transform.
    assert feature_with_name(features, "stores.MEAN(transactions.amount) = square_ft")


def test_makes_direct_of_agg_of_trans_on_target(es):
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=["mean"],
        trans_primitives=[Absolute],
        max_depth=3,
    )

    features = dfs_obj.build_features()
    assert feature_with_name(features, "sessions.MEAN(log.ABSOLUTE(value))")


def test_primitive_options_errors(es):
    wrong_key_options = {"mean": {"ignore_dataframe": ["sessions"]}}
    wrong_type_list = {"mean": {"ignore_dataframes": "sessions"}}
    wrong_type_dict = {"mean": {"ignore_columns": {"sessions": "product_id"}}}
    conflicting_primitive_options = {
        ("count", "mean"): {"ignore_dataframes": ["sessions"]},
        "mean": {"include_dataframes": ["sessions"]},
    }
    invalid_dataframe = {"mean": {"include_dataframes": ["invalid_dataframe"]}}
    invalid_column_dataframe = {
        "mean": {"include_columns": {"invalid_dataframe": ["product_id"]}},
    }
    invalid_column = {"mean": {"include_columns": {"sessions": ["invalid_column"]}}}
    key_error_text = "Unrecognized primitive option 'ignore_dataframe' for mean"
    list_error_text = "Incorrect type formatting for 'ignore_dataframes' for mean"
    dict_error_text = "Incorrect type formatting for 'ignore_columns' for mean"
    conflicting_error_text = "Multiple options found for primitive mean"
    invalid_dataframe_warning = "Dataframe 'invalid_dataframe' not in entityset"
    invalid_column_warning = "Column 'invalid_column' not in dataframe 'sessions'"
    with pytest.raises(KeyError, match=key_error_text):
        DeepFeatureSynthesis(
            target_dataframe_name="customers",
            entityset=es,
            agg_primitives=["mean"],
            trans_primitives=[],
            primitive_options=wrong_key_options,
        )
    with pytest.raises(TypeError, match=list_error_text):
        DeepFeatureSynthesis(
            target_dataframe_name="customers",
            entityset=es,
            agg_primitives=["mean"],
            trans_primitives=[],
            primitive_options=wrong_type_list,
        )
    with pytest.raises(TypeError, match=dict_error_text):
        DeepFeatureSynthesis(
            target_dataframe_name="customers",
            entityset=es,
            agg_primitives=["mean"],
            trans_primitives=[],
            primitive_options=wrong_type_dict,
        )
    with pytest.raises(KeyError, match=conflicting_error_text):
        DeepFeatureSynthesis(
            target_dataframe_name="customers",
            entityset=es,
            agg_primitives=["mean"],
            trans_primitives=[],
            primitive_options=conflicting_primitive_options,
        )
    with pytest.warns(UserWarning, match=invalid_dataframe_warning) as record:
        DeepFeatureSynthesis(
            target_dataframe_name="customers",
            entityset=es,
            agg_primitives=["mean"],
            trans_primitives=[],
            primitive_options=invalid_dataframe,
        )
    assert len(record) == 1
    with pytest.warns(UserWarning, match=invalid_dataframe_warning) as record:
        DeepFeatureSynthesis(
            target_dataframe_name="customers",
            entityset=es,
            agg_primitives=["mean"],
            trans_primitives=[],
            primitive_options=invalid_column_dataframe,
        )
    assert len(record) == 1
    with pytest.warns(UserWarning, match=invalid_column_warning) as record:
        DeepFeatureSynthesis(
            target_dataframe_name="customers",
            entityset=es,
            agg_primitives=["mean"],
            trans_primitives=[],
            primitive_options=invalid_column,
        )
    assert len(record) == 1


def test_primitive_options(es):
    options = {
        "sum": {"include_columns": {"customers": ["age"]}},
        "mean": {"include_dataframes": ["customers"]},
        "mode": {"ignore_dataframes": ["sessions"]},
        "num_unique": {"ignore_columns": {"customers": ["engagement_level"]}},
    }
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="cohorts",
        entityset=es,
        primitive_options=options,
    )
    features = dfs_obj.build_features()

    for f in features:
        deps = f.get_dependencies(deep=True)
        df_names = [d.dataframe_name for d in deps]
        columns = [d for d in deps if isinstance(d, IdentityFeature)]
        if isinstance(f.primitive, Sum):
            for identity_base in columns:
                if identity_base.dataframe_name == "customers":
                    assert identity_base.get_name() == "age"
        if isinstance(f.primitive, Mean):
            assert all([df_name in ["customers"] for df_name in df_names])
        if isinstance(f.primitive, Mode):
            assert "sessions" not in df_names
        if isinstance(f.primitive, NumUnique):
            for identity_base in columns:
                assert not (
                    identity_base.dataframe_name == "customers"
                    and identity_base.get_name() == "engagement_level"
                )

    options = {
        "month": {"ignore_columns": {"customers": ["birthday"]}},
        "day": {"include_columns": {"customers": ["signup_date", "upgrade_date"]}},
        "num_characters": {"ignore_dataframes": ["customers"]},
        "year": {"include_dataframes": ["customers"]},
    }
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="customers",
        entityset=es,
        agg_primitives=[],
        ignore_dataframes=["cohort"],
        primitive_options=options,
    )
    features = dfs_obj.build_features()
    assert not any([isinstance(f, NumCharacters) for f in features])
    for f in features:
        deps = f.get_dependencies(deep=True)
        df_names = [d.dataframe_name for d in deps]
        columns = [d for d in deps if isinstance(d, IdentityFeature)]
        if isinstance(f.primitive, Month):
            for identity_base in columns:
                assert not (
                    identity_base.dataframe_name == "customers"
                    and identity_base.get_name() == "birthday"
                )
        if isinstance(f.primitive, Day):
            for identity_base in columns:
                if identity_base.dataframe_name == "customers":
                    assert (
                        identity_base.get_name() == "signup_date"
                        or identity_base.get_name() == "upgrade_date"
                    )
        if isinstance(f.primitive, Year):
            assert all([df_name in ["customers"] for df_name in df_names])


def test_primitive_options_with_globals(es):
    # non-overlapping ignore_dataframes
    options = {"mode": {"ignore_dataframes": ["sessions"]}}
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="cohorts",
        entityset=es,
        ignore_dataframes=["régions"],
        primitive_options=options,
    )
    features = dfs_obj.build_features()
    for f in features:
        deps = f.get_dependencies(deep=True)
        df_names = [d.dataframe_name for d in deps]
        assert "régions" not in df_names
        if isinstance(f.primitive, Mode):
            assert "sessions" not in df_names

    # non-overlapping ignore_columns
    options = {"num_unique": {"ignore_columns": {"customers": ["engagement_level"]}}}
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="customers",
        entityset=es,
        ignore_columns={"customers": ["région_id"]},
        primitive_options=options,
    )
    features = dfs_obj.build_features()
    for f in features:
        deps = f.get_dependencies(deep=True)
        columns = [d for d in deps if isinstance(d, IdentityFeature)]
        for identity_base in columns:
            assert not (
                identity_base.dataframe_name == "customers"
                and identity_base.get_name() == "région_id"
            )
        if isinstance(f.primitive, NumUnique):
            for identity_base in columns:
                assert not (
                    identity_base.dataframe_name == "customers"
                    and identity_base.get_name() == "engagement_level"
                )

    # Overlapping globals/options with ignore_dataframes
    options = {
        "mode": {
            "include_dataframes": ["sessions", "customers"],
            "ignore_columns": {"customers": ["région_id"]},
        },
        "num_unique": {
            "include_dataframes": ["sessions", "customers"],
            "include_columns": {"sessions": ["device_type"], "customers": ["age"]},
        },
        "month": {"ignore_columns": {"cohorts": ["cohort_end"]}},
    }
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="cohorts",
        entityset=es,
        ignore_dataframes=["sessions"],
        ignore_columns={"customers": ["age"]},
        primitive_options=options,
    )
    features = dfs_obj.build_features()
    for f in features:
        assert f.primitive.name != "month"
        # ignoring cohorts means no features are created
        assert not isinstance(f.primitive, Month)

        deps = f.get_dependencies(deep=True)
        df_names = [d.dataframe_name for d in deps]
        columns = [d for d in deps if isinstance(d, IdentityFeature)]
        if isinstance(f.primitive, Mode):
            assert [all([df_name in ["sessions", "customers"] for df_name in df_names])]
            for identity_base in columns:
                assert not (
                    identity_base.dataframe_name == "customers"
                    and (
                        identity_base.get_name() == "age"
                        or identity_base.get_name() == "région_id"
                    )
                )
        elif isinstance(f.primitive, NumUnique):
            assert [all([df_name in ["sessions", "customers"] for df_name in df_names])]
            for identity_base in columns:
                if identity_base.dataframe_name == "sessions":
                    assert identity_base.get_name() == "device_type"
        # All other primitives ignore 'sessions' and 'age'
        else:
            assert "sessions" not in df_names
            for identity_base in columns:
                assert not (
                    identity_base.dataframe_name == "customers"
                    and identity_base.get_name() == "age"
                )


def test_primitive_options_groupbys(es):
    options = {
        "cum_count": {"include_groupby_dataframes": ["log", "customers"]},
        "cum_sum": {"ignore_groupby_dataframes": ["sessions"]},
        "cum_mean": {
            "ignore_groupby_columns": {
                "customers": ["région_id"],
                "log": ["session_id"],
            },
        },
        "cum_min": {
            "include_groupby_columns": {"sessions": ["customer_id", "device_type"]},
        },
    }

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[],
        max_depth=3,
        groupby_trans_primitives=["cum_sum", "cum_count", "cum_min", "cum_mean"],
        primitive_options=options,
    )
    features = dfs_obj.build_features()
    for f in features:
        if isinstance(f, GroupByTransformFeature):
            deps = f.groupby.get_dependencies(deep=True)
            df_names = [d.dataframe_name for d in deps] + [f.groupby.dataframe_name]
            columns = [d for d in deps if isinstance(d, IdentityFeature)]
            columns += [f.groupby] if isinstance(f.groupby, IdentityFeature) else []
        if isinstance(f.primitive, CumMean):
            for identity_groupby in columns:
                assert not (
                    identity_groupby.dataframe_name == "customers"
                    and identity_groupby.get_name() == "région_id"
                )
                assert not (
                    identity_groupby.dataframe_name == "log"
                    and identity_groupby.get_name() == "session_id"
                )
        if isinstance(f.primitive, CumCount):
            assert all([name in ["log", "customers"] for name in df_names])
        if isinstance(f.primitive, CumSum):
            assert "sessions" not in df_names
        if isinstance(f.primitive, CumMin):
            for identity_groupby in columns:
                if identity_groupby.dataframe_name == "sessions":
                    assert (
                        identity_groupby.get_name() == "customer_id"
                        or identity_groupby.get_name() == "device_type"
                    )


def test_primitive_options_multiple_inputs(es):
    too_many_options = {
        "mode": [{"include_dataframes": ["logs"]}, {"ignore_dataframes": ["sessions"]}],
    }
    error_msg = "Number of options does not match number of inputs for primitive mode"
    with pytest.raises(AssertionError, match=error_msg):
        DeepFeatureSynthesis(
            target_dataframe_name="customers",
            entityset=es,
            agg_primitives=["mode"],
            trans_primitives=[],
            primitive_options=too_many_options,
        )

    unknown_primitive = Trend()
    unknown_primitive.name = "unknown_primitive"
    unknown_primitive_option = {
        "unknown_primitive": [
            {"include_dataframes": ["logs"]},
            {"ignore_dataframes": ["sessions"]},
        ],
    }
    error_msg = "Unknown primitive with name 'unknown_primitive'"
    with pytest.raises(ValueError, match=error_msg):
        DeepFeatureSynthesis(
            target_dataframe_name="customers",
            entityset=es,
            agg_primitives=[unknown_primitive],
            trans_primitives=[],
            primitive_options=unknown_primitive_option,
        )

    options1 = {
        "trend": [
            {"include_dataframes": ["log"], "ignore_columns": {"log": ["value"]}},
            {"include_dataframes": ["log"], "include_columns": {"log": ["datetime"]}},
        ],
    }
    dfs_obj1 = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=["trend"],
        trans_primitives=[],
        primitive_options=options1,
    )
    features1 = dfs_obj1.build_features()
    for f in features1:
        deps = f.get_dependencies()
        df_names = [d.dataframe_name for d in deps]
        columns = [d.get_name() for d in deps]
        if f.primitive.name == "trend":
            assert all([df_name in ["log"] for df_name in df_names])
            assert "datetime" in columns
            if len(columns) == 2:
                assert "value" != columns[0]

    options2 = {
        Trend: [
            {"include_dataframes": ["log"], "ignore_columns": {"log": ["value"]}},
            {"include_dataframes": ["log"], "include_columns": {"log": ["datetime"]}},
        ],
    }
    dfs_obj2 = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=["trend"],
        trans_primitives=[],
        primitive_options=options2,
    )
    features2 = dfs_obj2.build_features()

    assert set(features2) == set(features1)


def test_primitive_options_class_names(es):
    options1 = {"mean": {"include_dataframes": ["customers"]}}

    options2 = {Mean: {"include_dataframes": ["customers"]}}

    bad_options = {
        "mean": {"include_dataframes": ["customers"]},
        Mean: {"ignore_dataframes": ["customers"]},
    }
    conflicting_error_text = "Multiple options found for primitive mean"

    primitives = [["mean"], [Mean]]
    options = [options1, options2]

    features = []
    for primitive in primitives:
        with pytest.raises(KeyError, match=conflicting_error_text):
            DeepFeatureSynthesis(
                target_dataframe_name="cohorts",
                entityset=es,
                agg_primitives=primitive,
                trans_primitives=[],
                primitive_options=bad_options,
            )
        for option in options:
            dfs_obj = DeepFeatureSynthesis(
                target_dataframe_name="cohorts",
                entityset=es,
                agg_primitives=primitive,
                trans_primitives=[],
                primitive_options=option,
            )
            features.append(set(dfs_obj.build_features()))

    for f in features[0]:
        deps = f.get_dependencies(deep=True)
        df_names = [d.dataframe_name for d in deps]
        if isinstance(f.primitive, Mean):
            assert all(df_name == "customers" for df_name in df_names)

    assert features[0] == features[1] == features[2] == features[3]


def test_primitive_options_instantiated_primitive(es):
    warning_msg = (
        "Options present for primitive instance and generic "
        "primitive class \\(mean\\), primitive instance will not use generic "
        "options"
    )

    skipna_mean = Mean(skipna=False)
    options = {
        skipna_mean: {"include_dataframes": ["stores"]},
        "mean": {"ignore_dataframes": ["stores"]},
    }
    with pytest.warns(UserWarning, match=warning_msg):
        dfs_obj = DeepFeatureSynthesis(
            target_dataframe_name="régions",
            entityset=es,
            agg_primitives=["mean", skipna_mean],
            trans_primitives=[],
            primitive_options=options,
        )

    features = dfs_obj.build_features()
    for f in features:
        deps = f.get_dependencies(deep=True)
        df_names = [d.dataframe_name for d in deps]
        if f.primitive == skipna_mean:
            assert all(df_name == "stores" for df_name in df_names)
        elif isinstance(f.primitive, Mean):
            assert "stores" not in df_names


def test_primitive_options_commutative(es):
    class AddThree(TransformPrimitive):
        name = "add_three"
        input_types = [
            ColumnSchema(semantic_tags={"numeric"}),
            ColumnSchema(semantic_tags={"numeric"}),
            ColumnSchema(semantic_tags={"numeric"}),
        ]
        return_type = ColumnSchema(semantic_tags={"numeric"})
        commutative = True

        def generate_name(self, base_feature_names):
            return "%s + %s + %s" % (
                base_feature_names[0],
                base_feature_names[1],
                base_feature_names[2],
            )

    options = {
        "add_numeric": [
            {"include_columns": {"log": ["value_2"]}},
            {"include_columns": {"log": ["value"]}},
        ],
        AddThree: [
            {"include_columns": {"log": ["value_2"]}},
            {"include_columns": {"log": ["value_many_nans"]}},
            {"include_columns": {"log": ["value"]}},
        ],
    }
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[AddNumeric, AddThree],
        primitive_options=options,
        max_depth=1,
    )
    features = dfs_obj.build_features()
    add_numeric = [f for f in features if isinstance(f.primitive, AddNumeric)]
    assert len(add_numeric) == 1
    deps = add_numeric[0].get_dependencies(deep=True)
    assert deps[0].get_name() == "value_2" and deps[1].get_name() == "value"

    add_three = [f for f in features if isinstance(f.primitive, AddThree)]
    assert len(add_three) == 1
    deps = add_three[0].get_dependencies(deep=True)
    assert (
        deps[0].get_name() == "value_2"
        and deps[1].get_name() == "value_many_nans"
        and deps[2].get_name() == "value"
    )


def test_primitive_options_include_over_exclude(es):
    options = {
        "mean": {"ignore_dataframes": ["stores"], "include_dataframes": ["stores"]},
    }
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="régions",
        entityset=es,
        agg_primitives=["mean"],
        trans_primitives=[],
        primitive_options=options,
    )

    features = dfs_obj.build_features()
    at_least_one_mean = False
    for f in features:
        deps = f.get_dependencies(deep=True)
        dataframes = [d.dataframe_name for d in deps]
        if isinstance(f.primitive, Mean):
            at_least_one_mean = True
            assert "stores" in dataframes
    assert at_least_one_mean


def test_primitive_ordering():
    # Test that the order of the input primitives impacts neither
    # which features are created nor their order
    es = make_ecommerce_entityset()

    trans_prims = [AddNumeric, Absolute, "divide_numeric", NotEqual, "is_null"]
    groupby_trans_prim = ["cum_mean", CumMin, CumSum]
    agg_prims = [NMostCommon(n=3), Sum, Mean, Mean(skipna=False), "min", "max"]
    where_prims = ["count", Sum]

    seed_num_chars = Feature(
        es["customers"].ww["favorite_quote"],
        primitive=NumCharacters,
    )
    seed_is_null = Feature(es["customers"].ww["age"], primitive=IsNull)
    seed_features = [seed_num_chars, seed_is_null]

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="customers",
        entityset=es,
        trans_primitives=trans_prims,
        groupby_trans_primitives=groupby_trans_prim,
        agg_primitives=agg_prims,
        where_primitives=where_prims,
        seed_features=seed_features,
        max_features=-1,
        max_depth=2,
    )
    features1 = dfs_obj.build_features()

    trans_prims.reverse()
    groupby_trans_prim.reverse()
    agg_prims.reverse()
    where_prims.reverse()
    seed_features.reverse()

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="customers",
        entityset=es,
        trans_primitives=trans_prims,
        groupby_trans_primitives=groupby_trans_prim,
        agg_primitives=agg_prims,
        where_primitives=where_prims,
        seed_features=seed_features,
        max_features=-1,
        max_depth=2,
    )
    features2 = dfs_obj.build_features()

    assert len(features1) == len(features2)

    for i in range(len(features2)):
        assert features1[i].unique_name() == features2[i].unique_name()


def test_no_transform_stacking():
    df1 = pd.DataFrame({"id": [0, 1, 2, 3], "A": [0, 1, 2, 3]})
    df2 = pd.DataFrame(
        {"index": [0, 1, 2, 3], "first_id": [0, 1, 1, 3], "B": [99, 88, 77, 66]},
    )

    dataframes = {"first": (df1, "id"), "second": (df2, "index")}
    relationships = [("first", "id", "second", "first_id")]
    es = EntitySet("data", dataframes, relationships)

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="second",
        entityset=es,
        trans_primitives=["negate", "add_numeric"],
        agg_primitives=["sum"],
        max_depth=4,
    )
    feature_defs = dfs_obj.build_features()

    expected = [
        "first_id",
        "B",
        "-(B)",
        "first.A",
        "first.SUM(second.B)",
        "first.-(A)",
        "B + first.A",
        "first.SUM(second.-(B))",
        "first.A + SUM(second.B)",
        "first.-(SUM(second.B))",
        "B + first.SUM(second.B)",
        "first.A + SUM(second.-(B))",
        "first.SUM(second.-(B)) + SUM(second.B)",
        "first.-(SUM(second.-(B)))",
        "B + first.SUM(second.-(B))",
    ]

    assert len(feature_defs) == len(expected)

    for feature_name in expected:
        assert feature_with_name(feature_defs, feature_name)


def test_builds_seed_features_on_foreign_key_col(es):
    seed_feature_sessions = Feature(es["sessions"].ww["customer_id"], primitive=Negate)

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[],
        max_depth=2,
        seed_features=[seed_feature_sessions],
    )

    features = dfs_obj.build_features()
    assert feature_with_name(features, "-(customer_id)")


def test_does_not_build_features_on_last_time_index_col(es):
    es.add_last_time_indexes()

    dfs_obj = DeepFeatureSynthesis(target_dataframe_name="log", entityset=es)

    features = dfs_obj.build_features()

    for feature in features:
        assert LTI_COLUMN_NAME not in feature.get_name()


def test_builds_features_using_all_input_types(es):
    new_log_df = es["log"]
    new_log_df.ww["purchased_nullable"] = es["log"]["purchased"]
    new_log_df.ww.set_types(logical_types={"purchased_nullable": "boolean_nullable"})
    es.replace_dataframe("log", new_log_df)

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        trans_primitives=[Not],
        max_depth=1,
    )
    trans_features = dfs_obj.build_features()
    assert feature_with_name(trans_features, "NOT(purchased)")
    assert feature_with_name(trans_features, "NOT(purchased_nullable)")

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        groupby_trans_primitives=[Not],
        max_depth=1,
    )
    groupby_trans_features = dfs_obj.build_features()
    assert feature_with_name(groupby_trans_features, "NOT(purchased) by session_id")
    assert feature_with_name(
        groupby_trans_features,
        "NOT(purchased_nullable) by session_id",
    )

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="sessions",
        entityset=es,
        trans_primitives=[],
        agg_primitives=[NumTrue],
    )
    agg_features = dfs_obj.build_features()
    assert feature_with_name(agg_features, "NUM_TRUE(log.purchased)")
    assert feature_with_name(agg_features, "NUM_TRUE(log.purchased_nullable)")


def test_make_groupby_features_with_depth_none(es):
    # If max_depth is set to -1, it sets it to None internally, so this
    # test validates code paths that have a None max_depth
    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[],
        trans_primitives=[],
        groupby_trans_primitives=["cum_sum"],
        max_depth=-1,
    )
    features = dfs_obj.build_features()
    assert feature_with_name(features, "CUM_SUM(value) by session_id")


def test_check_stacking_when_building_transform_features(es):
    class NewMean(Mean):
        name = "NEW_MEAN"
        base_of_exclude = [Absolute]

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[NewMean, "mean"],
        trans_primitives=["absolute"],
        max_depth=-1,
    )
    features = dfs_obj.build_features()
    assert number_of_features_with_name_like(features, "ABSOLUTE(MEAN") > 0
    assert number_of_features_with_name_like(features, "ABSOLUTE(NEW_MEAN") == 0


def test_check_stacking_when_building_groupby_features(es):
    class NewMean(Mean):
        name = "NEW_MEAN"
        base_of_exclude = [CumSum]

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=[NewMean, "mean"],
        groupby_trans_primitives=["cum_sum"],
        max_depth=5,
    )
    features = dfs_obj.build_features()
    assert number_of_features_with_name_like(features, "CUM_SUM(MEAN") > 0
    assert number_of_features_with_name_like(features, "CUM_SUM(NEW_MEAN") == 0


def test_check_stacking_when_building_agg_features(es):
    class NewAbsolute(Absolute):
        name = "NEW_ABSOLUTE"
        base_of_exclude = [Mean]

    dfs_obj = DeepFeatureSynthesis(
        target_dataframe_name="log",
        entityset=es,
        agg_primitives=["mean"],
        trans_primitives=[NewAbsolute, "absolute"],
        max_depth=5,
    )
    features = dfs_obj.build_features()
    assert number_of_features_with_name_like(features, "MEAN(log.ABSOLUTE") > 0
    assert number_of_features_with_name_like(features, "MEAN(log.NEW_ABSOLUTE") == 0


================================================
FILE: featuretools/tests/synthesis/test_dfs_method.py
================================================
import warnings
from unittest.mock import patch

import composeml as cp
import numpy as np
import pandas as pd
import pytest
from packaging.version import parse
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import NaturalLanguage

from featuretools.computational_backends.calculate_feature_matrix import (
    FEATURE_CALCULATION_PERCENTAGE,
)
from featuretools.entityset import EntitySet, Timedelta
from featuretools.exceptions import UnusedPrimitiveWarning
from featuretools.primitives import GreaterThanScalar, Max, Mean, Min, Sum
from featuretools.primitives.base import AggregationPrimitive, TransformPrimitive
from featuretools.synthesis import dfs
from featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis


@pytest.fixture
def datetime_es():
    cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]})
    transactions_df = pd.DataFrame(
        {
            "id": [1, 2, 3, 4, 5],
            "card_id": [1, 1, 5, 1, 5],
            "transaction_time": pd.to_datetime(
                [
                    "2011-2-28 04:00",
                    "2012-2-28 05:00",
                    "2012-2-29 06:00",
                    "2012-3-1 08:00",
                    "2014-4-1 10:00",
                ],
            ),
            "fraud": [True, False, False, False, True],
        },
    )

    datetime_es = EntitySet(id="fraud_data")
    datetime_es = datetime_es.add_dataframe(
        dataframe_name="transactions",
        dataframe=transactions_df,
        index="id",
        time_index="transaction_time",
    )

    datetime_es = datetime_es.add_dataframe(
        dataframe_name="cards",
        dataframe=cards_df,
        index="id",
    )

    datetime_es = datetime_es.add_relationship("cards", "id", "transactions", "card_id")
    datetime_es.add_last_time_indexes()
    return datetime_es


def test_dfs_empty_features():
    error_text = "No features can be generated from the specified primitives. Please make sure the primitives you are using are compatible with the variable types in your data."
    teams = pd.DataFrame({"id": range(3), "name": ["Breakers", "Spirit", "Thorns"]})
    games = pd.DataFrame(
        {
            "id": range(5),
            "home_team_id": [2, 2, 1, 0, 1],
            "away_team_id": [1, 0, 2, 1, 0],
            "home_team_score": [3, 0, 1, 0, 4],
            "away_team_score": [2, 1, 2, 0, 0],
        },
    )
    dataframes = {
        "teams": (teams, "id", None, {"name": "natural_language"}),
        "games": (games, "id"),
    }
    relationships = [("teams", "id", "games", "home_team_id")]
    with patch.object(DeepFeatureSynthesis, "build_features", return_value=[]):
        features = dfs(
            dataframes,
            relationships,
            target_dataframe_name="teams",
            features_only=True,
        )
        assert features == []
    with (
        pytest.raises(AssertionError, match=error_text),
        patch.object(
            DeepFeatureSynthesis,
            "build_features",
            return_value=[],
        ),
    ):
        dfs(
            dataframes,
            relationships,
            target_dataframe_name="teams",
            features_only=False,
        )


def test_passing_strings_to_logical_types_dfs():
    teams = pd.DataFrame({"id": range(3), "name": ["Breakers", "Spirit", "Thorns"]})
    games = pd.DataFrame(
        {
            "id": range(5),
            "home_team_id": [2, 2, 1, 0, 1],
            "away_team_id": [1, 0, 2, 1, 0],
            "home_team_score": [3, 0, 1, 0, 4],
            "away_team_score": [2, 1, 2, 0, 0],
        },
    )
    dataframes = {
        "teams": (teams, "id", None, {"name": "natural_language"}),
        "games": (games, "id"),
    }
    relationships = [("teams", "id", "games", "home_team_id")]

    features = dfs(
        dataframes,
        relationships,
        target_dataframe_name="teams",
        features_only=True,
    )

    name_logical_type = features[0].dataframe["name"].ww.logical_type
    assert isinstance(name_logical_type, NaturalLanguage)


def test_accepts_cutoff_time_df(dataframes, relationships):
    cutoff_times_df = pd.DataFrame({"instance_id": [1, 2, 3], "time": [10, 12, 15]})
    feature_matrix, features = dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="transactions",
        cutoff_time=cutoff_times_df,
    )
    feature_matrix = feature_matrix
    assert len(feature_matrix.index) == 3
    assert len(feature_matrix.columns) == len(features)


def test_accepts_cutoff_time_compose(dataframes, relationships):
    def fraud_occured(df):
        return df["fraud"].any()

    kwargs = {
        "time_index": "transaction_time",
        "labeling_function": fraud_occured,
        "window_size": 1,
    }
    if parse(cp.__version__) >= parse("0.10.0"):
        kwargs["target_dataframe_index"] = "card_id"
    else:
        kwargs["target_dataframe_name"] = "card_id"  # pragma: no cover

    lm = cp.LabelMaker(**kwargs)

    transactions_df = dataframes["transactions"][0]

    labels = lm.search(transactions_df, num_examples_per_instance=-1)

    labels["time"] = pd.to_numeric(labels["time"])
    labels.rename({"card_id": "id"}, axis=1, inplace=True)

    feature_matrix, features = dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="cards",
        cutoff_time=labels,
    )
    assert len(feature_matrix.index) == 6
    assert len(feature_matrix.columns) == len(features) + 1


def test_accepts_single_cutoff_time(dataframes, relationships):
    feature_matrix, features = dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="transactions",
        cutoff_time=20,
    )
    assert len(feature_matrix.index) == 5
    assert len(feature_matrix.columns) == len(features)


def test_accepts_no_cutoff_time(dataframes, relationships):
    feature_matrix, features = dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="transactions",
        instance_ids=[1, 2, 3, 5, 6],
    )
    assert len(feature_matrix.index) == 5
    assert len(feature_matrix.columns) == len(features)


def test_ignores_instance_ids_if_cutoff_df(dataframes, relationships):
    cutoff_times_df = pd.DataFrame({"instance_id": [1, 2, 3], "time": [10, 12, 15]})
    instance_ids = [1, 2, 3, 4, 5]
    feature_matrix, features = dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="transactions",
        cutoff_time=cutoff_times_df,
        instance_ids=instance_ids,
    )
    assert len(feature_matrix.index) == 3
    assert len(feature_matrix.columns) == len(features)


def test_approximate_features(dataframes, relationships):
    cutoff_times_df = pd.DataFrame(
        {"instance_id": [1, 3, 1, 5, 3, 6], "time": [11, 16, 16, 26, 17, 22]},
    )
    # force column to BooleanNullable
    dataframes["transactions"] += ({"fraud": "BooleanNullable"},)
    feature_matrix, features = dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="transactions",
        cutoff_time=cutoff_times_df,
        approximate=5,
        cutoff_time_in_index=True,
    )
    direct_agg_feat_name = "cards.PERCENT_TRUE(transactions.fraud)"
    assert len(feature_matrix.index) == 6
    assert len(feature_matrix.columns) == len(features)

    truth_values = pd.Series(data=[1.0, 0.5, 0.5, 1.0, 0.5, 1.0])

    assert (feature_matrix[direct_agg_feat_name] == truth_values.values).all()


def test_all_columns(dataframes, relationships):
    cutoff_times_df = pd.DataFrame({"instance_id": [1, 2, 3], "time": [10, 12, 15]})
    feature_matrix, features = dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="transactions",
        cutoff_time=cutoff_times_df,
        agg_primitives=[Max, Mean, Min, Sum],
        trans_primitives=[],
        groupby_trans_primitives=["cum_sum"],
        max_depth=3,
        allowed_paths=None,
        ignore_dataframes=None,
        ignore_columns=None,
        seed_features=None,
    )
    assert len(feature_matrix.index) == 3
    assert len(feature_matrix.columns) == len(features)


def test_features_only(dataframes, relationships):
    if len(dataframes["transactions"]) > 3:
        dataframes["transactions"][3]["fraud"] = "BooleanNullable"
    else:
        dataframes["transactions"] += ({"fraud": "BooleanNullable"},)
    features = dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="transactions",
        features_only=True,
    )

    expected_features = 11
    assert len(features) == expected_features


def test_accepts_relative_training_window(datetime_es):
    feature_matrix, _ = dfs(entityset=datetime_es, target_dataframe_name="transactions")

    feature_matrix_2, _ = dfs(
        entityset=datetime_es,
        target_dataframe_name="transactions",
        cutoff_time=pd.Timestamp("2012-4-1 04:00"),
    )

    feature_matrix_3, _ = dfs(
        entityset=datetime_es,
        target_dataframe_name="transactions",
        cutoff_time=pd.Timestamp("2012-4-1 04:00"),
        training_window=Timedelta("3 months"),
    )

    feature_matrix_4, _ = dfs(
        entityset=datetime_es,
        target_dataframe_name="transactions",
        cutoff_time=pd.Timestamp("2012-4-1 04:00"),
        training_window="3 months",
    )

    assert (feature_matrix.index == [1, 2, 3, 4, 5]).all()
    assert (feature_matrix_2.index == [1, 2, 3, 4]).all()
    assert (feature_matrix_3.index == [2, 3, 4]).all()
    assert (feature_matrix_4.index == [2, 3, 4]).all()

    # Test case for leap years
    feature_matrix_5, _ = dfs(
        entityset=datetime_es,
        target_dataframe_name="transactions",
        cutoff_time=pd.Timestamp("2012-2-29 04:00"),
        training_window=Timedelta("1 year"),
        include_cutoff_time=True,
    )
    assert (feature_matrix_5.index == [2]).all()

    feature_matrix_5, _ = dfs(
        entityset=datetime_es,
        target_dataframe_name="transactions",
        cutoff_time=pd.Timestamp("2012-2-29 04:00"),
        training_window=Timedelta("1 year"),
        include_cutoff_time=False,
    )
    assert (feature_matrix_5.index == [1, 2]).all()


def test_accepts_pd_timedelta_training_window(datetime_es):
    feature_matrix, _ = dfs(
        entityset=datetime_es,
        target_dataframe_name="transactions",
        cutoff_time=pd.Timestamp("2012-3-31 04:00"),
        training_window=pd.Timedelta(61, "D"),
    )

    assert (feature_matrix.index == [2, 3, 4]).all()


def test_accepts_pd_dateoffset_training_window(datetime_es):
    feature_matrix, _ = dfs(
        entityset=datetime_es,
        target_dataframe_name="transactions",
        cutoff_time=pd.Timestamp("2012-3-31 04:00"),
        training_window=pd.DateOffset(months=2),
    )

    feature_matrix_2, _ = dfs(
        entityset=datetime_es,
        target_dataframe_name="transactions",
        cutoff_time=pd.Timestamp("2012-3-31 04:00"),
        training_window=pd.offsets.BDay(44),
    )

    assert (feature_matrix.index == [2, 3, 4]).all()
    assert (feature_matrix.index == feature_matrix_2.index).all()


def test_accepts_datetime_and_string_offset(datetime_es):
    feature_matrix, _ = dfs(
        entityset=datetime_es,
        target_dataframe_name="transactions",
        cutoff_time=pd.to_datetime("2012-3-31 04:00"),
        training_window=pd.DateOffset(months=2),
    )

    feature_matrix_2, _ = dfs(
        entityset=datetime_es,
        target_dataframe_name="transactions",
        cutoff_time="2012-3-31 04:00",
        training_window=pd.offsets.BDay(44),
    )

    assert (feature_matrix.index == [2, 3, 4]).all()
    assert (feature_matrix.index == feature_matrix_2.index).all()


def test_handles_pandas_parser_error(datetime_es):
    with pytest.raises(ValueError):
        _, _ = dfs(
            entityset=datetime_es,
            target_dataframe_name="transactions",
            cutoff_time="2--012-----3-----31 04:00",
            training_window=pd.DateOffset(months=2),
        )


def test_handles_pandas_overflow_error(datetime_es):
    # pandas 1.5.0 raises ValueError, older versions raised OverflowError
    with pytest.raises((OverflowError, ValueError)):
        _, _ = dfs(
            entityset=datetime_es,
            target_dataframe_name="transactions",
            cutoff_time="200000000000000000000000000000000000000000000000000000000000000000-3-31 04:00",
            training_window=pd.DateOffset(months=2),
        )


def test_warns_with_unused_primitives(es):
    trans_primitives = ["num_characters", "num_words", "add_numeric"]
    agg_primitives = [Max, "min"]

    warning_text = (
        "Some specified primitives were not used during DFS:\n"
        + "  trans_primitives: ['add_numeric']\n  agg_primitives: ['max', 'min']\n"
        + "This may be caused by a using a value of max_depth that is too small, not setting interesting values, "
        + "or it may indicate no compatible columns for the primitive were found in the data. If the DFS call "
        + "contained multiple instances of a primitive in the list above, none of them were used."
    )

    with pytest.warns(UnusedPrimitiveWarning) as record:
        dfs(
            entityset=es,
            target_dataframe_name="customers",
            trans_primitives=trans_primitives,
            agg_primitives=agg_primitives,
            max_depth=1,
            features_only=True,
        )

    assert record[0].message.args[0] == warning_text

    # Should not raise a warning
    with warnings.catch_warnings():
        warnings.simplefilter("error")
        dfs(
            entityset=es,
            target_dataframe_name="customers",
            trans_primitives=trans_primitives,
            agg_primitives=agg_primitives,
            max_depth=2,
            features_only=True,
        )


def test_no_warns_with_camel_and_title_case(es):
    for trans_primitive in ["isNull", "IsNull"]:
        # Should not raise a UnusedPrimitiveWarning warning
        with warnings.catch_warnings():
            warnings.simplefilter("error")
            dfs(
                entityset=es,
                target_dataframe_name="customers",
                trans_primitives=[trans_primitive],
                max_depth=1,
                features_only=True,
            )

    for agg_primitive in ["numUnique", "NumUnique"]:
        # Should not raise a UnusedPrimitiveWarning warning
        with warnings.catch_warnings():
            warnings.simplefilter("error")
            dfs(
                entityset=es,
                target_dataframe_name="customers",
                agg_primitives=[agg_primitive],
                max_depth=2,
                features_only=True,
            )


def test_does_not_warn_with_stacking_feature(es):
    with warnings.catch_warnings():
        warnings.simplefilter("error")
        dfs(
            entityset=es,
            target_dataframe_name="régions",
            agg_primitives=["percent_true"],
            trans_primitives=[GreaterThanScalar(5)],
            primitive_options={
                "greater_than_scalar": {"include_dataframes": ["stores"]},
            },
            features_only=True,
        )


def test_warns_with_unused_where_primitives(es):
    warning_text = (
        "Some specified primitives were not used during DFS:\n"
        + "  where_primitives: ['count', 'sum']\n"
        + "This may be caused by a using a value of max_depth that is too small, not setting interesting values, "
        + "or it may indicate no compatible columns for the primitive were found in the data. If the DFS call "
        + "contained multiple instances of a primitive in the list above, none of them were used."
    )

    with pytest.warns(UnusedPrimitiveWarning) as record:
        dfs(
            entityset=es,
            target_dataframe_name="customers",
            agg_primitives=["count"],
            where_primitives=["sum", "count"],
            max_depth=1,
            features_only=True,
        )

    assert record[0].message.args[0] == warning_text


def test_warns_with_unused_groupby_primitives(es):
    warning_text = (
        "Some specified primitives were not used during DFS:\n"
        + "  groupby_trans_primitives: ['cum_sum']\n"
        + "This may be caused by a using a value of max_depth that is too small, not setting interesting values, "
        + "or it may indicate no compatible columns for the primitive were found in the data. If the DFS call "
        + "contained multiple instances of a primitive in the list above, none of them were used."
    )

    with pytest.warns(UnusedPrimitiveWarning) as record:
        dfs(
            entityset=es,
            target_dataframe_name="sessions",
            groupby_trans_primitives=["cum_sum"],
            max_depth=1,
            features_only=True,
        )

    assert record[0].message.args[0] == warning_text

    # Should not raise a warning
    with warnings.catch_warnings():
        warnings.simplefilter("error")
        dfs(
            entityset=es,
            target_dataframe_name="customers",
            groupby_trans_primitives=["cum_sum"],
            max_depth=1,
            features_only=True,
        )


def test_warns_with_unused_custom_primitives(es):
    class AboveTen(TransformPrimitive):
        name = "above_ten"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})

    trans_primitives = [AboveTen]

    warning_text = (
        "Some specified primitives were not used during DFS:\n"
        + "  trans_primitives: ['above_ten']\n"
        + "This may be caused by a using a value of max_depth that is too small, not setting interesting values, "
        + "or it may indicate no compatible columns for the primitive were found in the data. If the DFS call "
        + "contained multiple instances of a primitive in the list above, none of them were used."
    )

    with pytest.warns(UnusedPrimitiveWarning) as record:
        dfs(
            entityset=es,
            target_dataframe_name="sessions",
            trans_primitives=trans_primitives,
            max_depth=1,
            features_only=True,
        )

    assert record[0].message.args[0] == warning_text

    # Should not raise a warning
    with warnings.catch_warnings():
        warnings.simplefilter("error")
        dfs(
            entityset=es,
            target_dataframe_name="customers",
            trans_primitives=trans_primitives,
            max_depth=1,
            features_only=True,
        )

    class MaxAboveTen(AggregationPrimitive):
        name = "max_above_ten"
        input_types = [ColumnSchema(semantic_tags={"numeric"})]
        return_type = ColumnSchema(semantic_tags={"numeric"})

    agg_primitives = [MaxAboveTen]

    warning_text = (
        "Some specified primitives were not used during DFS:\n"
        + "  agg_primitives: ['max_above_ten']\n"
        + "This may be caused by a using a value of max_depth that is too small, not setting interesting values, "
        + "or it may indicate no compatible columns for the primitive were found in the data. If the DFS call "
        + "contained multiple instances of a primitive in the list above, none of them were used."
    )

    with pytest.warns(UnusedPrimitiveWarning) as record:
        dfs(
            entityset=es,
            target_dataframe_name="stores",
            agg_primitives=agg_primitives,
            max_depth=1,
            features_only=True,
        )

    assert record[0].message.args[0] == warning_text

    # Should not raise a warning
    with warnings.catch_warnings():
        warnings.simplefilter("error")
        dfs(
            entityset=es,
            target_dataframe_name="sessions",
            agg_primitives=agg_primitives,
            max_depth=1,
            features_only=True,
        )


def test_calls_progress_callback(dataframes, relationships):
    class MockProgressCallback:
        def __init__(self):
            self.progress_history = []
            self.total_update = 0
            self.total_progress_percent = 0

        def __call__(self, update, progress_percent, time_elapsed):
            self.total_update += update
            self.total_progress_percent = progress_percent
            self.progress_history.append(progress_percent)

    mock_progress_callback = MockProgressCallback()

    dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="transactions",
        progress_callback=mock_progress_callback,
    )

    # second to last entry is the last update from feature calculation
    assert np.isclose(
        mock_progress_callback.progress_history[-2],
        FEATURE_CALCULATION_PERCENTAGE * 100,
    )
    assert np.isclose(mock_progress_callback.total_update, 100.0)
    assert np.isclose(mock_progress_callback.total_progress_percent, 100.0)


def test_calls_progress_callback_cluster(dataframes, relationships, dask_cluster):
    class MockProgressCallback:
        def __init__(self):
            self.progress_history = []
            self.total_update = 0
            self.total_progress_percent = 0

        def __call__(self, update, progress_percent, time_elapsed):
            self.total_update += update
            self.total_progress_percent = progress_percent
            self.progress_history.append(progress_percent)

    mock_progress_callback = MockProgressCallback()

    dkwargs = {"cluster": dask_cluster.scheduler.address}
    dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="transactions",
        progress_callback=mock_progress_callback,
        dask_kwargs=dkwargs,
    )

    assert np.isclose(mock_progress_callback.total_update, 100.0)
    assert np.isclose(mock_progress_callback.total_progress_percent, 100.0)


def test_dask_kwargs(dataframes, relationships, dask_cluster):
    cutoff_times_df = pd.DataFrame({"instance_id": [1, 2, 3], "time": [10, 12, 15]})
    feature_matrix, features = dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="transactions",
        cutoff_time=cutoff_times_df,
    )

    dask_kwargs = {"cluster": dask_cluster.scheduler.address}
    feature_matrix_2, features_2 = dfs(
        dataframes=dataframes,
        relationships=relationships,
        target_dataframe_name="transactions",
        cutoff_time=cutoff_times_df,
        dask_kwargs=dask_kwargs,
    )

    assert all(
        f1.unique_name() == f2.unique_name() for f1, f2 in zip(features, features_2)
    )
    for column in feature_matrix:
        for x, y in zip(feature_matrix[column], feature_matrix_2[column]):
            assert (pd.isnull(x) and pd.isnull(y)) or (x == y)


================================================
FILE: featuretools/tests/synthesis/test_encode_features.py
================================================
import pandas as pd
import pytest

from featuretools import EntitySet, calculate_feature_matrix, dfs
from featuretools.feature_base import Feature, IdentityFeature
from featuretools.primitives import NMostCommon
from featuretools.synthesis import encode_features


def test_encodes_features(es):
    f1 = IdentityFeature(es["log"].ww["product_id"])
    f2 = IdentityFeature(es["log"].ww["purchased"])
    f3 = IdentityFeature(es["log"].ww["value"])

    features = [f1, f2, f3]
    feature_matrix = calculate_feature_matrix(
        features,
        es,
        instance_ids=[0, 1, 2, 3, 4, 5],
    )

    _, features_encoded = encode_features(feature_matrix, features)
    assert len(features_encoded) == 6

    _, features_encoded = encode_features(feature_matrix, features, top_n=2)
    assert len(features_encoded) == 5

    _, features_encoded = encode_features(
        feature_matrix,
        features,
        include_unknown=False,
    )
    assert len(features_encoded) == 5


def test_inplace_encodes_features(es):
    f1 = IdentityFeature(es["log"].ww["product_id"])

    features = [f1]
    feature_matrix = calculate_feature_matrix(
        features,
        es,
        instance_ids=[0, 1, 2, 3, 4, 5],
    )

    feature_matrix_shape = feature_matrix.shape
    feature_matrix_encoded, _ = encode_features(feature_matrix, features)
    assert feature_matrix_encoded.shape != feature_matrix_shape
    assert feature_matrix.shape == feature_matrix_shape

    # inplace they should be the same
    feature_matrix_encoded, _ = encode_features(feature_matrix, features, inplace=True)
    assert feature_matrix_encoded.shape == feature_matrix.shape


def test_to_encode_features(es):
    f1 = IdentityFeature(es["log"].ww["product_id"])
    f2 = IdentityFeature(es["log"].ww["value"])
    f3 = IdentityFeature(es["log"].ww["datetime"])

    features = [f1, f2, f3]
    feature_matrix = calculate_feature_matrix(
        features,
        es,
        instance_ids=[0, 1, 2, 3, 4, 5],
    )

    feature_matrix_encoded, _ = encode_features(feature_matrix, features)
    feature_matrix_encoded_shape = feature_matrix_encoded.shape

    # to_encode should keep product_id as a string and datetime as a date,
    # and not have the same shape as previous encoded matrix due to fewer encoded features
    to_encode = []
    feature_matrix_encoded, _ = encode_features(
        feature_matrix,
        features,
        to_encode=to_encode,
    )
    assert feature_matrix_encoded_shape != feature_matrix_encoded.shape
    assert feature_matrix_encoded["datetime"].dtype == "datetime64[ns]"
    assert feature_matrix_encoded["product_id"].dtype == "category"

    to_encode = ["value"]
    feature_matrix_encoded, _ = encode_features(
        feature_matrix,
        features,
        to_encode=to_encode,
    )
    assert feature_matrix_encoded_shape != feature_matrix_encoded.shape
    assert feature_matrix_encoded["datetime"].dtype == "datetime64[ns]"
    assert feature_matrix_encoded["product_id"].dtype == "category"


def test_encode_features_handles_pass_columns(es):
    f1 = IdentityFeature(es["log"].ww["product_id"])
    f2 = IdentityFeature(es["log"].ww["value"])

    features = [f1, f2]
    cutoff_time = pd.DataFrame(
        {
            "instance_id": range(6),
            "time": es["log"]["datetime"][0:6],
            "label": [i % 2 for i in range(6)],
        },
        columns=["instance_id", "time", "label"],
    )
    feature_matrix = calculate_feature_matrix(features, es, cutoff_time)

    assert "label" in feature_matrix.columns

    feature_matrix_encoded, _ = encode_features(feature_matrix, features)
    feature_matrix_encoded_shape = feature_matrix_encoded.shape

    # to_encode should keep product_id as a string, and not create 3 additional columns
    to_encode = []
    feature_matrix_encoded, _ = encode_features(
        feature_matrix,
        features,
        to_encode=to_encode,
    )
    assert feature_matrix_encoded_shape != feature_matrix_encoded.shape

    to_encode = ["value"]
    feature_matrix_encoded, _ = encode_features(
        feature_matrix,
        features,
        to_encode=to_encode,
    )
    assert feature_matrix_encoded_shape != feature_matrix_encoded.shape

    assert "label" in feature_matrix_encoded.columns


def test_encode_features_catches_features_mismatch(es):
    f1 = IdentityFeature(es["log"].ww["product_id"])
    f2 = IdentityFeature(es["log"].ww["value"])
    f3 = IdentityFeature(es["log"].ww["session_id"])

    features = [f1, f2]
    cutoff_time = pd.DataFrame(
        {
            "instance_id": range(6),
            "time": es["log"]["datetime"][0:6],
            "label": [i % 2 for i in range(6)],
        },
        columns=["instance_id", "time", "label"],
    )
    feature_matrix = calculate_feature_matrix(features, es, cutoff_time)

    assert "label" in feature_matrix.columns

    error_text = "Feature session_id not found in feature matrix"
    with pytest.raises(AssertionError, match=error_text):
        encode_features(feature_matrix, [f1, f3])


def test_encode_unknown_features():
    # Dataframe with categorical column with "unknown" string
    df = pd.DataFrame({"category": ["unknown", "b", "c", "d", "e"]}).astype(
        {"category": "category"},
    )

    es = EntitySet("test")
    es.add_dataframe(
        dataframe_name="a",
        dataframe=df,
        index="index",
        make_index=True,
    )
    features, feature_defs = dfs(
        entityset=es,
        target_dataframe_name="a",
        max_depth=1,
    )

    # Specify unknown token for replacement
    features_enc, _ = encode_features(features, feature_defs, include_unknown=True)
    assert list(features_enc.columns) == [
        "category = unknown",
        "category = e",
        "category = d",
        "category = c",
        "category = b",
        "category is unknown",
    ]


def test_encode_features_topn(es):
    topn = Feature(
        Feature(es["log"].ww["product_id"]),
        parent_dataframe_name="customers",
        primitive=NMostCommon(n=3),
    )
    features, feature_defs = dfs(
        entityset=es,
        instance_ids=[0, 1, 2],
        target_dataframe_name="customers",
        agg_primitives=[NMostCommon(n=3)],
    )
    features_enc, feature_defs_enc = encode_features(
        features,
        feature_defs,
        include_unknown=True,
    )
    assert topn.unique_name() in [feat.unique_name() for feat in feature_defs_enc]
    for name in topn.get_feature_names():
        assert name in features_enc.columns
        assert features_enc.columns.tolist().count(name) == 1


def test_encode_features_drop_first():
    df = pd.DataFrame({"category": ["ao", "b", "c", "d", "e"]}).astype(
        {"category": "category"},
    )
    es = EntitySet("test")
    es.add_dataframe(
        dataframe_name="a",
        dataframe=df,
        index="index",
        make_index=True,
    )
    features, feature_defs = dfs(
        entityset=es,
        target_dataframe_name="a",
        max_depth=1,
    )
    features_enc, _ = encode_features(
        features,
        feature_defs,
        drop_first=True,
        include_unknown=False,
    )
    assert len(features_enc.columns) == 4

    features_enc, feature_defs = encode_features(
        features,
        feature_defs,
        top_n=3,
        drop_first=True,
        include_unknown=False,
    )

    assert len(features_enc.columns) == 2


def test_encode_features_handles_dictionary_input(es):
    f1 = IdentityFeature(es["log"].ww["product_id"])
    f2 = IdentityFeature(es["log"].ww["purchased"])
    f3 = IdentityFeature(es["log"].ww["session_id"])

    features = [f1, f2, f3]
    feature_matrix = calculate_feature_matrix(features, es, instance_ids=range(16))
    feature_matrix_encoded, features_encoded = encode_features(feature_matrix, features)
    true_values = [
        "product_id = coke zero",
        "product_id = toothpaste",
        "product_id = car",
        "product_id = brown bag",
        "product_id = taco clock",
        "product_id = Haribo sugar-free gummy bears",
        "product_id is unknown",
        "purchased",
        "session_id = 0",
        "session_id = 1",
        "session_id = 4",
        "session_id = 3",
        "session_id = 5",
        "session_id = 2",
        "session_id is unknown",
    ]
    assert len(features_encoded) == 15
    for col in true_values:
        assert col in list(feature_matrix_encoded.columns)

    top_n_dict = {}
    feature_matrix_encoded, features_encoded = encode_features(
        feature_matrix,
        features,
        top_n=top_n_dict,
    )
    assert len(features_encoded) == 15
    for col in true_values:
        assert col in list(feature_matrix_encoded.columns)

    top_n_dict = {f1.get_name(): 4, f3.get_name(): 3}
    feature_matrix_encoded, features_encoded = encode_features(
        feature_matrix,
        features,
        top_n=top_n_dict,
    )
    assert len(features_encoded) == 10
    true_values = [
        "product_id = coke zero",
        "product_id = toothpaste",
        "product_id = car",
        "product_id = brown bag",
        "product_id is unknown",
        "purchased",
        "session_id = 0",
        "session_id = 1",
        "session_id = 4",
        "session_id is unknown",
    ]
    for col in true_values:
        assert col in list(feature_matrix_encoded.columns)

    feature_matrix_encoded, features_encoded = encode_features(
        feature_matrix,
        features,
        top_n=top_n_dict,
        include_unknown=False,
    )
    true_values = [
        "product_id = coke zero",
        "product_id = toothpaste",
        "product_id = car",
        "product_id = brown bag",
        "purchased",
        "session_id = 0",
        "session_id = 1",
        "session_id = 4",
    ]
    assert len(features_encoded) == 8
    for col in true_values:
        assert col in list(feature_matrix_encoded.columns)


def test_encode_features_matches_calculate_feature_matrix():
    df = pd.DataFrame({"category": ["b", "c", "d", "e"]}).astype(
        {"category": "category"},
    )

    es = EntitySet("test")
    es.add_dataframe(
        dataframe_name="a",
        dataframe=df,
        index="index",
        make_index=True,
    )
    features, feature_defs = dfs(
        entityset=es,
        target_dataframe_name="a",
        max_depth=1,
    )

    features_enc, feature_defs_enc = encode_features(
        features,
        feature_defs,
        to_encode=["category"],
    )

    features_calc = calculate_feature_matrix(feature_defs_enc, entityset=es)

    pd.testing.assert_frame_equal(features_enc, features_calc)
    assert features_calc.ww._schema == features_enc.ww._schema


================================================
FILE: featuretools/tests/synthesis/test_get_valid_primitives.py
================================================
import pytest
from woodwork.column_schema import ColumnSchema

from featuretools.primitives import (
    AggregationPrimitive,
    Count,
    Hour,
    IsIn,
    Not,
    TimeSincePrevious,
    TransformPrimitive,
)
from featuretools.synthesis.get_valid_primitives import get_valid_primitives


def test_get_valid_primitives_selected_primitives(es):
    agg_prims, trans_prims = get_valid_primitives(
        es,
        "log",
        selected_primitives=[Hour, Count],
    )
    assert set(agg_prims) == set([Count])
    assert set(trans_prims) == set([Hour])

    agg_prims, trans_prims = get_valid_primitives(
        es,
        "products",
        selected_primitives=[Hour],
        max_depth=1,
    )
    assert set(agg_prims) == set()
    assert set(trans_prims) == set()


def test_get_valid_primitives_selected_primitives_strings(es):
    agg_prims, trans_prims = get_valid_primitives(
        es,
        "log",
        selected_primitives=["hour", "count"],
    )
    assert set(agg_prims) == set([Count])
    assert set(trans_prims) == set([Hour])

    agg_prims, trans_prims = get_valid_primitives(
        es,
        "products",
        selected_primitives=["hour"],
        max_depth=1,
    )
    assert set(agg_prims) == set()
    assert set(trans_prims) == set()


def test_invalid_primitive(es):
    with pytest.raises(ValueError, match="'foobar' is not a recognized primitive name"):
        get_valid_primitives(
            es,
            target_dataframe_name="log",
            selected_primitives=["foobar"],
        )

    msg = (
        "Selected primitive <class 'woodwork.column_schema.ColumnSchema'> "
        "is not an AggregationPrimitive, TransformPrimitive, or str"
    )
    with pytest.raises(ValueError, match=msg):
        get_valid_primitives(
            es,
            target_dataframe_name="log",
            selected_primitives=[ColumnSchema],
        )


def test_primitive_compatibility(es):
    _, trans_prims = get_valid_primitives(
        es,
        "customers",
        selected_primitives=[TimeSincePrevious],
    )
    assert len(trans_prims) == 1


def test_get_valid_primitives_custom_primitives(es):
    class ThreeMostCommonCat(AggregationPrimitive):
        name = "n_most_common_categorical"
        input_types = [ColumnSchema(semantic_tags={"category"})]
        return_type = ColumnSchema(semantic_tags={"category"})
        number_output_features = 3

    class AddThree(TransformPrimitive):
        name = "add_three"
        input_types = [
            ColumnSchema(semantic_tags="numeric"),
            ColumnSchema(semantic_tags="numeric"),
            ColumnSchema(semantic_tags="numeric"),
        ]
        return_type = ColumnSchema(semantic_tags="numeric")
        commutative = True

    agg_prims, trans_prims = get_valid_primitives(es, "log")
    assert ThreeMostCommonCat not in agg_prims
    assert AddThree not in trans_prims

    with pytest.raises(
        ValueError,
        match="'add_three' is not a recognized primitive name",
    ):
        agg_prims, trans_prims = get_valid_primitives(
            es,
            "log",
            2,
            [ThreeMostCommonCat, "add_three"],
        )


def test_get_valid_primitives_all_primitives(es):
    agg_prims, trans_prims = get_valid_primitives(es, "customers")
    assert Count in agg_prims
    assert Hour in trans_prims


def test_get_valid_primitives_single_table(transform_es):
    msg = "Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created"
    with pytest.warns(UserWarning, match=msg):
        agg_prims, trans_prims = get_valid_primitives(transform_es, "first")

    assert set(agg_prims) == set()
    assert IsIn in trans_prims


def test_get_valid_primitives_with_dfs_kwargs(es):
    agg_prims, trans_prims = get_valid_primitives(
        es,
        "customers",
        selected_primitives=[Hour, Count, Not],
    )
    assert set(agg_prims) == set([Count])
    assert set(trans_prims) == set([Hour, Not])

    # Can use other dfs parameters and they get applied
    agg_prims, trans_prims = get_valid_primitives(
        es,
        "customers",
        selected_primitives=[Hour, Count, Not],
        ignore_columns={"customers": ["loves_ice_cream"]},
    )
    assert set(agg_prims) == set([Count])
    assert set(trans_prims) == set([Hour])

    agg_prims, trans_prims = get_valid_primitives(
        es,
        "products",
        selected_primitives=[Hour, Count],
        ignore_dataframes=["log"],
    )
    assert set(agg_prims) == set()
    assert set(trans_prims) == set()


================================================
FILE: featuretools/tests/test_version.py
================================================
from featuretools import __version__


def test_version():
    assert __version__ == "1.31.0"


================================================
FILE: featuretools/tests/testing_utils/__init__.py
================================================
# flake8: noqa
from featuretools.tests.testing_utils.cluster import (
    MockClient,
    mock_cluster,
    get_mock_client_cluster,
)
from featuretools.tests.testing_utils.es_utils import get_df_tags
from featuretools.tests.testing_utils.features import (
    feature_with_name,
    number_of_features_with_name_like,
    backward_path,
    forward_path,
    check_rename,
    check_names,
)
from featuretools.tests.testing_utils.mock_ds import make_ecommerce_entityset


================================================
FILE: featuretools/tests/testing_utils/cluster.py
================================================
from psutil import virtual_memory


def mock_cluster(
    n_workers=1,
    threads_per_worker=1,
    diagnostics_port=8787,
    memory_limit=None,
    **dask_kwarg,
):
    return (n_workers, threads_per_worker, diagnostics_port, memory_limit)


class MockClient:
    def __init__(self, cluster):
        self.cluster = cluster

    def scheduler_info(self):
        return {"workers": {"worker 1": {"memory_limit": virtual_memory().total}}}


def get_mock_client_cluster():
    return MockClient, mock_cluster


================================================
FILE: featuretools/tests/testing_utils/es_utils.py
================================================
def get_df_tags(df):
    """Gets a DataFrame's semantic tags without index or time index tags for Woodwork init"""
    semantic_tags = {}
    for col_name in df.columns:
        semantic_tags[col_name] = df.ww.semantic_tags[col_name] - {
            "time_index",
            "index",
        }

    return semantic_tags


================================================
FILE: featuretools/tests/testing_utils/features.py
================================================
import re

from featuretools.entityset.relationship import RelationshipPath


def feature_with_name(features, name):
    for f in features:
        if f.get_name() == name:
            return True
    return False


def number_of_features_with_name_like(features, pattern):
    """Returns number of features with names that match the provided regex pattern"""
    pattern = re.compile(re.escape(pattern))
    names = [f.get_name() for f in features]
    return len([name for name in names if pattern.search(name)])


def backward_path(es, dataframe_ids):
    """
    Create a backward RelationshipPath through the given dataframes. Assumes only
    one such path is possible.
    """

    def _get_relationship(child, parent):
        return next(
            r
            for r in es.get_forward_relationships(child)
            if r._parent_dataframe_name == parent
        )

    relationships = [
        _get_relationship(child, parent)
        for parent, child in zip(dataframe_ids[:-1], dataframe_ids[1:])
    ]

    return RelationshipPath([(False, r) for r in relationships])


def forward_path(es, dataframe_ids):
    """
    Create a forward RelationshipPath through the given dataframes. Assumes only
    one such path is possible.
    """

    def _get_relationship(child, parent):
        return next(
            r
            for r in es.get_forward_relationships(child)
            if r._parent_dataframe_name == parent
        )

    relationships = [
        _get_relationship(child, parent)
        for child, parent in zip(dataframe_ids[:-1], dataframe_ids[1:])
    ]

    return RelationshipPath([(True, r) for r in relationships])


def check_rename(feat, new_name, new_names):
    copy_feat = feat.rename(new_name)
    assert feat.unique_name() != copy_feat.unique_name()
    assert feat.get_name() != copy_feat.get_name()
    assert (
        feat.base_features[0].generate_name()
        == copy_feat.base_features[0].generate_name()
    )
    assert feat.dataframe_name == copy_feat.dataframe_name
    assert feat.get_feature_names() != copy_feat.get_feature_names()
    check_names(copy_feat, new_name, new_names)


def check_names(feat, new_name, new_names):
    assert feat.get_name() == new_name
    assert feat.get_feature_names() == new_names


================================================
FILE: featuretools/tests/testing_utils/generate_fake_dataframe.py
================================================
import random
from datetime import datetime as dt

import pandas as pd
import woodwork.type_sys.type_system as ww_type_system
from woodwork import logical_types

from featuretools.feature_discovery.utils import flatten_list

logical_type_mapping = {
    logical_types.Boolean.__name__: [True, False],
    logical_types.BooleanNullable.__name__: [True, False, pd.NA],
    logical_types.Categorical.__name__: ["A", "B", "C"],
    logical_types.Datetime.__name__: [
        dt(2020, 1, 1, 12, 0, 0),
        dt(2020, 6, 1, 12, 0, 0),
    ],
    logical_types.Double.__name__: [1.2, 2.3, 3.4],
    logical_types.Integer.__name__: [1, 2, 3],
    logical_types.IntegerNullable.__name__: [1, 2, 3, pd.NA],
    logical_types.EmailAddress.__name__: [
        "john.smith@example.com",
        "sally.jones@example.com",
    ],
    logical_types.LatLong.__name__: [(1, 2), (3, 4)],
    logical_types.NaturalLanguage.__name__: [
        "This is sentence 1",
        "This is sentence 2",
    ],
    logical_types.Ordinal.__name__: [1, 2, 3],
    logical_types.URL.__name__: ["https://www.example.com", "https://www.example2.com"],
    logical_types.PostalCode.__name__: ["60018", "60018-0123"],
}


def generate_fake_dataframe(
    col_defs=[("f_1", "Numeric"), ("f_2", "Datetime", "time_index")],
    n_rows=10,
    df_name="df",
):
    def randomize(values_):
        random.seed(10)
        values = values_.copy()
        random.shuffle(values)
        return values

    def gen_series(values):
        values = [values] * n_rows
        if isinstance(values, list):
            values = flatten_list(values)

        return randomize(values)[:n_rows]

    def get_tags(lt, tags=set()):
        inferred_tags = ww_type_system.str_to_logical_type(lt).standard_tags
        assert isinstance(inferred_tags, set)
        return inferred_tags.union(tags) - {"index", "time_index"}

    other_kwargs = {}

    df = pd.DataFrame()
    lt_dict = {}
    tags_dict = {}
    for name, lt_name, *rest in col_defs:
        if lt_name in logical_type_mapping:
            values = logical_type_mapping[lt_name]
            if lt_name == logical_types.Ordinal.__name__:
                lt = logical_types.Ordinal(order=values)
            else:
                lt = lt_name
            values = gen_series(values)
        else:
            raise Exception(f"Unknown logical type {lt_name}")

        lt_dict[name] = lt

        if len(rest):
            tags = rest[0]
            if "index" in tags:
                other_kwargs["index"] = name
                values = range(n_rows)
            if "time_index" in tags:
                other_kwargs["time_index"] = name
                values = pd.date_range("2000-01-01", periods=n_rows)
            tags_dict[name] = get_tags(lt_name, tags)
        else:
            tags_dict[name] = get_tags(lt_name)

        s = pd.Series(values, name=name)
        df = pd.concat([df, s], axis=1)

    df.ww.init(
        name=df_name,
        logical_types=lt_dict,
        semantic_tags=tags_dict,
        **other_kwargs,
    )

    return df


================================================
FILE: featuretools/tests/testing_utils/mock_ds.py
================================================
from datetime import datetime

import numpy as np
import pandas as pd
from woodwork.logical_types import (
    URL,
    Boolean,
    Categorical,
    CountryCode,
    Datetime,
    Double,
    EmailAddress,
    Filepath,
    Integer,
    IPAddress,
    LatLong,
    NaturalLanguage,
    Ordinal,
    PersonFullName,
    PhoneNumber,
    PostalCode,
    SubRegionCode,
)

from featuretools.entityset import EntitySet


def make_ecommerce_entityset(with_integer_time_index=False):
    """Makes a entityset with the following shape:

      R         Régions
     / \\       .
    S   C       Stores, Customers
        |       .
        S   P   Sessions, Products
         \\ /   .
          L     Log
    """
    dataframes = make_ecommerce_dataframes(
        with_integer_time_index=with_integer_time_index,
    )
    dataframe_names = dataframes.keys()
    es_id = "ecommerce"
    if with_integer_time_index:
        es_id += "_int_time_index"

    logical_types = make_logical_types(with_integer_time_index=with_integer_time_index)
    semantic_tags = make_semantic_tags()
    time_indexes = make_time_indexes(with_integer_time_index=with_integer_time_index)

    es = EntitySet(id=es_id)

    for df_name in dataframe_names:
        time_index = time_indexes.get(df_name, None)
        ti_name = None
        secondary = None
        if time_index is not None:
            ti_name = time_index["name"]
            secondary = time_index["secondary"]
        df = dataframes[df_name]
        es.add_dataframe(
            df,
            dataframe_name=df_name,
            index="id",
            logical_types=logical_types[df_name],
            semantic_tags=semantic_tags[df_name],
            time_index=ti_name,
            secondary_time_index=secondary,
        )

    es.normalize_dataframe(
        "customers",
        "cohorts",
        "cohort",
        additional_columns=["cohort_name"],
        make_time_index=True,
        new_dataframe_time_index="cohort_end",
    )

    es.add_relationships(
        [
            ("régions", "id", "customers", "région_id"),
            ("régions", "id", "stores", "région_id"),
            ("customers", "id", "sessions", "customer_id"),
            ("sessions", "id", "log", "session_id"),
            ("products", "id", "log", "product_id"),
        ],
    )

    return es


def make_ecommerce_dataframes(with_integer_time_index=False):
    region_df = pd.DataFrame(
        {"id": ["United States", "Mexico"], "language": ["en", "sp"]},
    )

    store_df = pd.DataFrame(
        {
            "id": range(6),
            "région_id": ["United States"] * 3 + ["Mexico"] * 2 + [np.nan],
            "num_square_feet": list(range(30000, 60000, 6000)) + [np.nan],
        },
    )

    product_df = pd.DataFrame(
        {
            "id": [
                "Haribo sugar-free gummy bears",
                "car",
                "toothpaste",
                "brown bag",
                "coke zero",
                "taco clock",
            ],
            "department": [
                "food",
                "electronics",
                "health",
                "food",
                "food",
                "electronics",
            ],
            "rating": [3.5, 4.0, 4.5, 1.5, 5.0, 5.0],
            "url": [
                "google.com",
                "https://www.featuretools.com/",
                "amazon.com",
                "www.featuretools.com",
                "bit.ly",
                "featuretools.com/demos/",
            ],
        },
    )
    customer_times = {
        "signup_date": [
            datetime(2011, 4, 8),
            datetime(2011, 4, 9),
            datetime(2011, 4, 6),
        ],
        # some point after signup date
        "upgrade_date": [
            datetime(2011, 4, 10),
            datetime(2011, 4, 11),
            datetime(2011, 4, 7),
        ],
        "cancel_date": [
            datetime(2011, 6, 8),
            datetime(2011, 10, 9),
            datetime(2012, 1, 6),
        ],
        "birthday": [datetime(1993, 3, 8), datetime(1926, 8, 2), datetime(1993, 4, 20)],
    }
    if with_integer_time_index:
        customer_times["signup_date"] = [6, 7, 4]
        customer_times["upgrade_date"] = [18, 26, 5]
        customer_times["cancel_date"] = [27, 28, 29]
        customer_times["birthday"] = [2, 1, 3]

    customer_df = pd.DataFrame(
        {
            "id": pd.Categorical([0, 1, 2]),
            "age": [33, 25, 56],
            "région_id": ["United States"] * 3,
            "cohort": [0, 1, 0],
            "cohort_name": ["Early Adopters", "Late Adopters", "Early Adopters"],
            "loves_ice_cream": [True, False, True],
            "favorite_quote": [
                "The proletariat have nothing to lose but their chains",
                "Capitalism deprives us all of self-determination",
                "All members of the working classes must seize the "
                "means of production.",
            ],
            "signup_date": customer_times["signup_date"],
            # some point after signup date
            "upgrade_date": customer_times["upgrade_date"],
            "cancel_date": customer_times["cancel_date"],
            "cancel_reason": ["reason_1", "reason_2", "reason_1"],
            "engagement_level": [1, 3, 2],
            "full_name": ["Mr. John Doe", "Doe, Mrs. Jane", "James Brown"],
            "email": ["john.smith@example.com", np.nan, "team@featuretools.com"],
            "phone_number": ["555-555-5555", "555-555-5555", "1-(555)-555-5555"],
            "birthday": customer_times["birthday"],
        },
    )

    ips = [
        "192.168.0.1",
        "2001:4860:4860::8888",
        "0.0.0.0",
        "192.168.1.1:2869",
        np.nan,
        np.nan,
    ]
    filepaths = [
        "/home/user/docs/Letter.txt",
        "./inthisdir",
        "C:\\user\\docs\\Letter.txt",
        "~/.rcinfo",
        "../../greatgrandparent",
        "data.json",
    ]

    session_df = pd.DataFrame(
        {
            "id": [0, 1, 2, 3, 4, 5],
            "customer_id": pd.Categorical([0, 0, 0, 1, 1, 2]),
            "device_type": [0, 1, 1, 0, 0, 1],
            "device_name": ["PC", "Mobile", "Mobile", "PC", "PC", "Mobile"],
            "ip": ips,
            "filepath": filepaths,
        },
    )

    times = list(
        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]
        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]
        + [datetime(2011, 4, 9, 10, 40, 0)]
        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]
        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]
        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)],
    )
    if with_integer_time_index:
        times = list(range(8, 18)) + list(range(19, 26))

    values = list(
        [i * 5 for i in range(5)]
        + [i * 1 for i in range(4)]
        + [0]
        + [i * 5 for i in range(2)]
        + [i * 7 for i in range(3)]
        + [np.nan] * 2,
    )

    values_2 = list(
        [i * 2 for i in range(5)]
        + [i * 1 for i in range(4)]
        + [0]
        + [i * 2 for i in range(2)]
        + [i * 3 for i in range(3)]
        + [np.nan] * 2,
    )

    values_many_nans = list(
        [np.nan] * 5
        + [i * 1 for i in range(4)]
        + [0]
        + [np.nan] * 2
        + [i * 3 for i in range(3)]
        + [np.nan] * 2,
    )

    latlong = list([(values[i], values_2[i]) for i, _ in enumerate(values)])
    latlong2 = list([(values_2[i], -values[i]) for i, _ in enumerate(values)])
    zipcodes = list(
        ["02116"] * 5
        + ["02116-3899"] * 4
        + ["0"]
        + ["1234567890"] * 2
        + ["12345-6789"] * 2
        + [np.nan] * 3,
    )
    countrycodes = list(["US"] * 5 + ["AL"] * 4 + [np.nan] * 5 + ["ALB"] * 2 + ["USA"])
    subregioncodes = list(
        ["US-AZ"] * 5 + ["US-MT"] * 4 + [np.nan] * 3 + ["UG-219"] * 2 + ["ZM-06"] * 3,
    )
    log_df = pd.DataFrame(
        {
            "id": range(17),
            "session_id": [0] * 5 + [1] * 4 + [2] * 1 + [3] * 2 + [4] * 3 + [5] * 2,
            "product_id": ["coke zero"] * 3
            + ["car"] * 2
            + ["toothpaste"] * 3
            + ["brown bag"] * 2
            + ["Haribo sugar-free gummy bears"]
            + ["coke zero"] * 4
            + ["taco clock"] * 2,
            "datetime": times,
            "value": values,
            "value_2": values_2,
            "latlong": latlong,
            "latlong2": latlong2,
            "zipcode": zipcodes,
            "countrycode": countrycodes,
            "subregioncode": subregioncodes,
            "value_many_nans": values_many_nans,
            "priority_level": [0] * 2 + [1] * 5 + [0] * 6 + [2] * 2 + [1] * 2,
            "purchased": [True] * 11 + [False] * 4 + [True, False],
            "url": ["https://www.featuretools.com/"] * 2
            + ["amazon.com"] * 2
            + [
                "www.featuretools.com",
                "bit.ly",
                "featuretools.com/demos/",
                "www.google.co.in/" "http://lplay.google.co.in",
                " ",
                "invalid_url",
                "an",
                "microsoft.com/search/",
            ]
            + [np.nan] * 5,
            "email_address": ["john.smith@example.com", np.nan, "team@featuretools.com"]
            * 5
            + [" prefix@space.com", "suffix@space.com "],
            "comments": [coke_zero_review()]
            + ["I loved it"] * 2
            + car_reviews()
            + toothpaste_reviews()
            + brown_bag_reviews()
            + [gummy_review()]
            + ["I loved it"] * 4
            + taco_clock_reviews(),
        },
    )

    return {
        "régions": region_df,
        "stores": store_df,
        "products": product_df,
        "customers": customer_df,
        "sessions": session_df,
        "log": log_df,
    }


def make_semantic_tags():
    store_semantic_tags = {"région_id": "foreign_key"}

    customer_semantic_tags = {"région_id": "foreign_key", "birthday": "date_of_birth"}

    session_semantic_tags = {"customer_id": "foreign_key"}

    log_semantic_tags = {"session_id": "foreign_key"}

    return {
        "customers": customer_semantic_tags,
        "sessions": session_semantic_tags,
        "log": log_semantic_tags,
        "products": {},
        "stores": store_semantic_tags,
        "régions": {},
    }


def make_logical_types(with_integer_time_index=False):
    region_logical_types = {"id": Categorical, "language": Categorical}

    store_logical_types = {
        "id": Integer,
        "région_id": Categorical,
        "num_square_feet": Double,
    }

    product_logical_types = {
        "id": Categorical,
        "rating": Double,
        "department": Categorical,
        "url": URL,
    }

    customer_logical_types = {
        "id": Integer,
        "age": Integer,
        "région_id": Categorical,
        "loves_ice_cream": Boolean,
        "favorite_quote": NaturalLanguage,
        "signup_date": Datetime(datetime_format="%Y-%m-%d"),
        "upgrade_date": Datetime(datetime_format="%Y-%m-%d"),
        "cancel_date": Datetime(datetime_format="%Y-%m-%d"),
        "cancel_reason": Categorical,
        "engagement_level": Ordinal(order=[1, 2, 3]),
        "full_name": PersonFullName,
        "email": EmailAddress,
        "phone_number": PhoneNumber,
        "birthday": Datetime(datetime_format="%Y-%m-%d"),
        "cohort_name": Categorical,
        "cohort": Integer,
    }

    session_logical_types = {
        "id": Integer,
        "customer_id": Integer,
        "device_type": Categorical,
        "device_name": Categorical,
        "ip": IPAddress,
        "filepath": Filepath,
    }

    log_logical_types = {
        "id": Integer,
        "session_id": Integer,
        "product_id": Categorical,
        "datetime": Datetime(datetime_format="%Y-%m-%d"),
        "value": Double,
        "value_2": Double,
        "latlong": LatLong,
        "latlong2": LatLong,
        "zipcode": PostalCode,
        "countrycode": CountryCode,
        "subregioncode": SubRegionCode,
        "value_many_nans": Double,
        "priority_level": Ordinal(order=[0, 1, 2]),
        "purchased": Boolean,
        "url": URL,
        "email_address": EmailAddress,
        "comments": NaturalLanguage,
    }
    if with_integer_time_index:
        log_logical_types["datetime"] = Integer
        customer_logical_types["signup_date"] = Integer
        customer_logical_types["upgrade_date"] = Integer
        customer_logical_types["cancel_date"] = Integer
        customer_logical_types["birthday"] = Integer

    return {
        "customers": customer_logical_types,
        "sessions": session_logical_types,
        "log": log_logical_types,
        "products": product_logical_types,
        "stores": store_logical_types,
        "régions": region_logical_types,
    }


def make_time_indexes(with_integer_time_index=False):
    return {
        "customers": {
            "name": "signup_date",
            "secondary": {"cancel_date": ["cancel_reason"]},
        },
        "log": {"name": "datetime", "secondary": None},
    }


def coke_zero_review():
    return """
When it comes to Coca-Cola products, people tend to be die-hard fans. Many of us know someone who can't go a day without a Diet Coke (or two or three). And while Diet Coke has been a leading sugar-free soft drink since it was first released in 1982, it came to light that young adult males shied away from this beverage — identifying diet cola as a woman's drink. The company's answer to that predicament came in 2005 - in the form of a shiny black can - with the release of Coca-Cola Zero.

While Diet Coke was created with its own flavor profile and not as a sugar-free version of the original, Coca-Cola Zero aims to taste just like the "real Coke flavor." Despite their polar opposite advertising campaigns, the contents and nutritional information of the two sugar-free colas is nearly identical. With that information in hand we at HuffPost Taste needed to know: Which of these two artificially-sweetened Coca-Cola beverages actually tastes better? And can you even tell the difference between them?

Before we get to the results of our taste test, here are the facts:


Diet Coke

Motto: Always Great Tast
Nutritional Information: Many say that a can of Diet Coke actually contains somewhere between 1-4 calories, but if a serving size contains fewer than 5 calories a company is not obligated to note it in its nutritional information. Diet Coke's nutritional information reads 0 Calories, 0g Fat, 40mg Sodium, 0g Total Carbs, 0g Protein.

Ingredients: Carbonated water, caramel color, aspartame, phosphoric acid, potassium benzonate, natural flavors, citric acid, caffeine.

Artificial sweetener: Aspartame


Coca-Cola Zero
Motto: Real Coca-Cola Taste AND Zero Calories

Nutritional Information: While the label clearly advertises this beverage as a zero calorie cola, we are not entirely certain that its minimal calorie content is simply not required to be noted in the nutritional information. Coca-Cola Zero's nutritional information reads 0 Calories, 0g Fat, 40mg Sodium, 0g Total Carbs, 0g Protein.

Artificial sweetener: Aspartame and acesulfame potassium

Ingredients: Carbonated water, caramel color, phosphoric acid, aspartame, potassium benzonate, natural flavors, potassium citrate, acesulfame potassium, caffeine.

The Verdict:
Twenty-four editors blind-tasted the two cokes, side by side, and...

54 percent of our tasters were able to distinguish Diet Coke from Coca-Cola Zero
50 percent of our tasters preferred Diet Coke to Coca-Cola Zero, and vice versa
Here’s what our tasters thought of the two sugar-free soft drinks:

Diet Coke: "Tastes fake right away." "Much fresher brighter, crisper." "Has the wonderful flavors of Diet Coke’s artificial sweeteners."

Coca-Cola Zero: "Has more of a sharply sweet aftertaste I associate with diet sodas." "Tastes more like regular coke, less like fake sweetener." "Has an odd taste." "Tastes more like regular." "Very sweet."

Overall comments: "That was a lot more difficult than I though it would be." "Both equally palatable." A few people said Diet Coke tasted much better ... unbeknownst to them, they were actually referring to Coca-Cola Zero.

IN SUMMARY: It is a real toss up. There is not one artificially-sweetened Coca-Cola beverage that outshines the other. So how do people choose between one or the other? It is either a matter of personal taste, or maybe the marketing campaigns will influence their choice.
"""


def gummy_review():
    return """
The place: BMO Harris Bradley Center
The event: Bucks VS Spurs
The snack: Satan's Diarrhea Hate Bears made by Haribo

I recently took my 4 year old son to his first NBA game. He was very excited to go to the game, and I was excited because we had fantastic seats. Row C center court to be exact. I've never sat that close before. I've never had to go DOWN stairs to get to my seats. 24 stairs to get to my seats to be exact.

His favorite candy is Skittles. Mine are anything gummy. I snuck in a bag of skittles for my son, and grabbed a handful of gummy bears for myself, to be later known as Satan's Diarrhea Hate Bears, that I received for Christmas in bulk from my parents, and put them in a zip lock bag.

After the excitement of the 1st quarter has ended I take my son out to get him a bottled water and myself a beer. We return to our seats to enjoy our candy and drinks.

..............fast forward until 1 minute before half time...........

I have begun to sweat a sweat that is only meant for a man on mile 19 of a marathon. I have kicked out my legs out so straight that I am violently pushing the gentleman wearing a suit seat in front of me forward. He is not happy, I do not care. My hands are on the side of my seat not unlike that of a gymnast on a pommel horse, lifting me off my chair. My son is oblivious to what is happening next to him, after all, there is a mascot running around somewhere and he is eating candy.

I realize that at some point in the very near to immediate future I am going to have to allow this lava from Satan to forcefully expel itself from my innards. I also realize that I have to walk up 24 stairs just to get to level ground in hopes to make it to the bathroom. I’ll just have to sit here stiff as a board for a few moments waiting for the pain to subside. About 30 seconds later there is a slight calm in the storm of the violent hurricane that is going on in my lower intestine. I muster the courage to gently relax every muscle in my lower half and stand up. My son stands up next to me and we start to ascend up the stairs. I take a very careful and calculated step up the first stair. Then a very loud horn sounds. Halftime. Great. It’s going to be crowded. The horn also seems to have awaken the Satan's Diarrhea Hate Bears that are having a mosh pit in my stomach. It literally felt like an avalanche went down my stomach and I again have to tighten every muscle and stand straight up and focus all my energy on my poor sphincter to tighten up and perform like it has never performed before. Taking another step would be the worst idea possible, the flood gates would open. Don’t worry, Daddy has a plan. I some how mumble the question, “want to play a game?” to my son, he of course says “yes”. My idea is to hop on both feet allllll the way up the stairs, using the center railing to propel me up each stair. My son is always up for a good hopping game, so he complies and joins in on the “fun”. Some old lady 4 steps up thinks its cute that we are doing this, obviously she wasn’t looking at the panic on my face. 3 rows behind her a man about the same age as me, who must have had similar situations, notices the fear/panic/desperation on my face understands the danger that I along with my pants and anyone within a 5 yard radius spray zone are in. He just mouths the words “good luck man” to me and I press on. Half way up and there is no leakage, but my legs are getting tired and my sphincter has never endured this amount of pressure for this long of time. 16 steps/hops later…….4 steps to go…….My son trips and falls on the stairs, I have two options: keep going knowing he will catch up or bend down to pick him up relieving my sphincter of all the pressure and commotion while ruining the day of roughly the 50 people that are now watching a grown man hop up stairs while sweating profusely next to a 4 year old boy.

Luckily he gets right back up and we make it to the top of the stairs. Good, the hard part was over. Or so I thought. I managed to waddle like a penguin, or someone who is about to poop their pants in 2.5 seconds, to the men's room only to find that every stall is being used. EVERY STALL. It's halftime, of course everyone has to poop at that moment. I don't know if I can wait any longer, do I go ahead and fulfil the dream of every high school boy and poop in the urinal? What kind of an example would that set for my son? On the other hand, what kind of an example would it be for his father to fill his pants with a substance that probably will be unrecognizable to man. Suddenly a stall door opens, and I think I manage to actually levitate over to the stall. I my son follows me in, luckily it was the handicap stall so there was room for him to be out of the way. I get my pants off and start to sit. I know what taking a giant poo feels like. I also know what vomiting feels like. I can now successfully say that I know what it is like to vomit out my butt. I wasn't pooping, those Satan's Diarrhea Hate Bears did something to my insides that made my sphincter vomit our the madness.

I am now conscious of my surroundings. Other than the war that the bottom half of my body is currently having with this porcelain chair, it is quiet as a pin drop in the bathroom. The other men in there can sense that something isn't right, no one has heard anyone ever poop vomit before.

I can sense that the worst part is over. But its not stopping, nor can I physically stop it at this point, I am leaking..it's horrible. I call out "does anyone have a diaper?" hoping that some gentleman was changing a baby. Nothing. No one said a word. I know people are in there, I can see the toes of shoes pointed in my direction under the stall.. "DOES ANYONE HAVE A DIAPER!?!" I am screaming, my son is now crying, he thinks he is witnessing the death of his father. I can't even assure him that I will make it.

Not a word was said, but a diaper was thrown over the stall. I catch it, line my underwear with it, put my pants back on, and walk out of that bathroom like a champ. We go straight to our seats, grab out coats and go home. As we are walking out, the gentleman that wished me good luck earlier simply put his fist out, and I happily bumped it.

My son asks me, "Daddy, why are we leaving early?"
"Well son, I need to change my diaper"
"""


def taco_clock_reviews():
    return [
        """
This timer does what it is supposed to do. Setup is elementary. Replacing the old one (after 12 years) was relatively easy. It has performed flawlessly since. I'm delighted I could find an esoteric product like this at Amazon. Their service, and the customer reviews, are just excellent.
""",
        """
Funny, cute clock. A little spendy for how light the clock is, but its hard to find a taco clock.
""",
    ]


def brown_bag_reviews():
    return [
        """
These bags looked exactly like I'd hoped, however, the handles broke off of almost every single bag as soon as items were placed in them! I used these as gift bags for out-of-town guests at my wedding, so imagine my embarassment as the handles broke off as I was handing them out. I would not recommend purchaing these bags unless you plan to fill them with nothing but paper! Anything heavier will cause the handles to snap right off.
""",
        """
I purchased these in August 2014 from Big Blue Supplies. I have no problem with the seller, these arrived new condition, fine shape.

I do have a slight problem with the bags. In case someone might want to know, the handles on these bags are set inside against the top. Then a piece of Kraft type packing tape is placed over the handles to hold them in place. On some of the bags, the tape is already starting to peel off. I would be really hesitant about using these bags unless I reinforced the current tape with a different adhesive.

I will keep the bags, and make a tape of a holiday or decorative theme and place over in order to make certain the handles stay in place.

Also in case anybody is wondering, the label on the plastic packaging bag states these are from ORIENTAL TRADING COMPANY. On the bottom of each bag it is stamped MADE IN CHINA. Again, I will be placing a sticker over that.

Even the dollar store bags I normally purchase do not have that stamped on the bottom in such prominent lettering. I purchased these because they were plain and I wanted to decorate them.

I do not think I would purchase again for all the reasons stated above.

Another thing for those still wanting to purchase, the ones I received were: 12 3/4 inches high not including handle, 10 1/4 inches wide and a 5 1/4 inch depth.
""",
    ]


def car_reviews():
    return [
        """
The full-size pickup truck and the V-8 engine were supposed to be inseparable, like the internet and cat videos. You can’t have one without the other—or so we thought.

In America’s most popular vehicle, the Ford F-150, two turbocharged six-cylinder engines marketed under the EcoBoost name have dethroned the naturally aspirated V-8. Ford’s new 2.7-liter twin-turbo V-6 is the popular choice, while the 3.5-liter twin-turbo V-6 is the top performer. The larger six allows for greater hauling capacity, accelerates the truck more quickly, and swills less gas in EPA testing than the V-8 alternative. It’s enough to make even old-school truck buyers acknowledge that there actually is a replacement for displacement.

And yet a V-8 in a big pickup truck still feels so natural, so right. In the F-150, the Coyote 5.0-liter V-8 is tuned for torque more so than power, yet it still revs with an enthusiastic giddy-up that reminds us that this engine’s other job is powering the Mustang. The response follows the throttle pedal faithfully while the six-speed automatic clicks through gears smoothly and easily. Together they pull this 5220-pound F-150 to 60 mph in 6.3 seconds, which is 0.4 second quicker than the 5.3-liter Chevrolet Silverado with the six-speed automatic and 0.9 second quicker than the 5.3 Silverado with the new eight-speed auto. The 3.5-liter EcoBoost, though, can do the deed another half-second quicker, but its synthetic soundtrack doesn’t have the rich, multilayered tone of the V-8.

It wasn’t until we saddled our test truck with a 6400-pound trailer (well under its 9000-pound rating) that we fully understood the case for upgrading to the 3.5-liter EcoBoost. The twin-turbo engine offers an extra 2500 pounds of towing capability and handles lighter tasks with considerably less strain. The 5.0-liter truck needs more revs and a wider throttle opening to accelerate its load, so we were often coaxed into pressing the throttle to the floor for even modest acceleration. The torquier EcoBoost engine offers a heartier response at part throttle.

In real-world, non-towing situations, the twin-turbo 3.5-liter doesn’t deliver on its promise of increased fuel economy, with both the 5.0-liter V-8 and that V-6 returning 16 mpg in our hands. But given the 3.5-liter’s virtues, we can forgive it that trespass.

Trucks Are the New Luxury

Pickups once were working-class transportation. Today, they’re proxy luxury vehicles—or at least that’s how they’re priced. If you think our test truck’s $57,240 window sticker is steep, consider that our model, the Lariat, is merely a mid-spec trim. There are three additional grades—King Ranch, Platinum, and Limited—positioned and priced above it, plus the 3.5-liter EcoBoost that costs an extra $400 as well as a plethora of options to inflate the price past 60 grand. Squint and you can almost see the six-figure trucks of the future on the horizon.

For the most part, though, the equipment in this particular Lariat lives up to the price tag. The driver and passenger seats are heated and cooled, with 10-way power adjustability and supple leather. The technology includes blind-spot monitoring, navigation, and a 110-volt AC outlet. Nods to utility include spotlights built into the side mirrors and Ford’s Pro Trailer Backup Assist, which makes reversing with a trailer as easy as turning a tiny knob on the dashboard.

Middle-Child Syndrome

In the F-150, Ford has a trifecta of engines (the fourth, a naturally aspirated 3.5-liter V-6, is best left to the fleet operators). The 2.7-liter twin-turbo V-6 delivers remarkable performance at an affordable price. The 3.5-liter twin-turbo V-6 is the workhorse, with power, torque, and hauling capability to spare. Compared with those two logical options, the middle-child 5.0-liter V-8 is the right-brain choice. Its strongest selling points may be its silky power delivery and the familiar V-8 rumble. That’s a flimsy argument when it comes to rationalizing a $50,000-plus purchase, though, so perhaps it’s no surprise that today’s boosted six-cylinders are now the engines of choice in the F-150.
""",
        """
THE GOOD
The Tesla Model S 90D's electric drivetrain is substantially more efficient than any internal combustion engine, and gives the car smooth and quick acceleration. All-wheel drive comes courtesy of a smart dual motor system. The new Autopilot feature eases the stress of stop-and-go traffic and long road trips.

THE BAD
Even at Tesla's Supercharger stations, recharging the battery takes significantly longer than refilling an internal combustion engine car's gas tank, limiting where you can drive. Tesla hasn't improved its infotainment system much from the Model S' launch.

THE BOTTOM LINE
Among the different flavors of Tesla Model S, the 90D is the one to get, exhibiting the best range and all-wheel drive, while offering an uncomplicated, next-generation driving experience that shows very well against equally priced competitors.


REVIEW  SPECIFICATIONS  PHOTOS
Roadshow Automobiles Tesla 2016 Tesla Model S
Having tested driver assistance systems in many cars, and even ridden in fully self-driving cars, I should have been ready for Tesla's new Autopilot feature. But engaging it while cruising the freeway in the Model S 90D, I kept my foot hovering over the brake.

My trepidation didn't come so much from the adaptive cruise control, which kept the Model S following traffic ahead at a set distance, but from the self-steering, this part of Autopilot managing to keep the Model S well-centered in its lane with no help from me. Over many miles, I built up more trust in the system, letting the car do the steering in situations from bumper-to-bumper traffic and a winding road through the hills.

2016 Tesla Model S 90DEnlarge Image
Although the middle of the Model S range, the 90D offers the best range and a wealth of useful tech, such as Autopilot self-driving.
Wayne Cunningham/Roadshow
Tesla added Autopilot to its Model S line as an option last year, along with all-wheel-drive. More recently, the high-tech automaker improved its batteries, upgrading its cars from their former 65 and 85 kilowatt-hour capacity to 70 and 90 kilowatt-hour. The example I drove, the 90D, represents all these advances.

More importantly, the 90D is the current range-leader among the Model S line, boasting 288 miles on a full battery charge.

The Model S' improvements fall outside of typical automotive industry product cycles, fulfilling Tesla's promise of acting more like a technology company, constantly building and deploying new features. Tesla accomplishes that goal partially through over-the-air software updates, improving existing cars, but the 90D presents significant hardware updates over the original Model S launched four years ago.

Sit and go
Of course, this Model S exhibited the ease of use of the original. Walking up to the car with the key fob in my pocket, it automatically unlocked. When I got in the car, it powered up without me having to push a start button, so I only needed to put it in drive to get on the road.

Likewise, the design hasn't changed, its sleek, hatchback four-door body offering excellent cargo room, both front and back, and seating space. The cabin feels less cramped than most cars due to the lack of a transmission tunnel and a dashboard bare of buttons or dials.

2016 Tesla Model S 90DEnlarge Image
The flat floor in the Model S' cabin makes for enhanced passenger room.
Wayne Cunningham/Roadshow
The big, 17-inch touchscreen in the center of the dashboard shows navigation, stereo, phone, energy consumption and car settings. I easily went from full-screen to a split-screen view, the windows showing each appearing instantly. A built-in 4G/LTE data connection powers Google maps and Internet-based audio. The LCD instrument panel in front of me showed my speed, energy usage, remaining range, and intelligently swapped audio information for turn-by-turn directions when started navigation.

The instrument panel actually made the experience of driving under Autopilot more comfortable, reassuring me with graphics that showed when the Model S' sensors were detecting the lane lines and the traffic around me. Impressively, the sensors could differentiate, as shown on the screen's graphics, a passenger car from a big truck.

At speed on the freeway, Autopilot smoothly maintained the car's position in its lane, and when I took my hands off the wheel for too long, it flashed a warning on the instrument panel. In stop-and-go traffic approaching a toll booth, the car did an even better job of self-driving, recognizing traffic around it and maintaining appropriate distances.

Handling surprise
Taking over the driving myself, the ride quality proved as comfortable as any sport-luxury car, as this Model S had its optional air suspension. The electric power steering is well-tuned, turning the wheels with a quiet, natural feel and good heft.

Audi S7 vs Tesla Model S
Shootout: Audi S7 vs. Tesla Model S
Wayne Cunningham/Roadshow
The biggest surprise came when I spent the day doing laps at the Thunderhill Raceway, negotiating a series of tight, technical turns in competition with an Audi S7. I expected the Model S to get out-of-shape in the turns, but instead it proved steady and solid. The Model S' 4,647-pound curb weight made it less than ideal for a track test, but much of that weight is in the battery pack, mounted low in the chassis. That low center of gravity helped limit body roll, ensuring good grip from all four tires. In the turns, the Model S felt nicely balanced, although not entirely nimble.

Helping its grip was its native all-wheel drive, gained from having motors driving each set of wheels. The combined output of the motors comes to 417 horsepower and 485 pound-feet of torque, those numbers expressed in 0-to-60 mph times of well under 5 seconds. That thrust made for fast runs down the race track's straightaways, or simply giving me the ability to take advantage of gaps in traffic on public roads.

288 miles is more than enough for most people's daily driving needs, and if you plug in every night, you will wake up to a fully charged car every morning. The Model S makes for a far different experience than driving an internal combustion car, where you need to go to a gas station to refuel. However, longer trips in the Model S require some planning, such as scheduling stops at Tesla's free Supercharger stations.


Charging times are much lengthier than refilling a tank with gasoline. From a Level 2, 240-volt station, you get 29 miles added every hour. Tesla's Supercharger, a Level 3 charger, takes 75 minutes to fully recharge the Model S 90D's battery.

2016 Tesla Model S 90DEnlarge Image
Despite its high initial price, the Model S 90D costs less to run on a daily basis than a combustion engine car.
Wayne Cunningham/Roadshow

Low maintenance
The 2016 Tesla Model S 90D adds features to keep it competitive against the internal combustion cars in its sport luxury set. More importantly, it remains very easy to live with. In fact, the electric drivetrain should mean greatly decreased maintenance, as there are fewer moving parts. The EPA estimates that annual electricity costs for the Model S 90D should run $650, much less than buying gasoline for an equivalent internal combustion car.

Lengthy charging times mean longer trips are either out of the question or require more planning than with an internal combustion car. And while the infotainment system responds quickly to touch inputs and offers useful screens, it hasn't changed much in four years. Most notably, Tesla hasn't added any music apps beyond the ones it launched with. Along with new, useful apps, it would be nice to have some themes or other aesthetic changes to the infotainment interface.

The Model S 90D's base price of $88,000 puts it out of reach of the average buyer, and the model I drove was optioned up to around $95,000. Against its Audi, BMW and Mercedes-Benz competition, however, it makes a compelling argument, especially for its uncomplicated nature.
""",
    ]


def toothpaste_reviews():
    return [
        """
Toothpaste can do more harm than good

The next time a patient innocently asks me, “What’s the best toothpaste to use?” I’m going to unleash a whole Chunky Soup can of “You Want The Truth? You CAN’T HANDLE THE TRUTH!!!” Gosh, that’s such an overused movie quote. Sorry about that, but still.

If you’re a dental professional, isn’t this the most annoying question you get, day after day? Do you even care which toothpaste your patients use?

No. You don’t. Asking a dentist what toothpaste to use is like asking your physician which bar of soap or body scrub you should use to clean your skin. Your dentist and dental hygienist have never seen a tube of toothpaste that singlehandedly improves the health of all patients in their practice, and the reason is simple:

Toothpaste is a cosmetic.

We brush our teeth so that out mouths no longer taste like… mouth. Mouth tastes gross, right? It tastes like putrefied skin. It tastes like tongue cheese. It tastes like Cream of Barf.

On the other hand, toothpaste has been exquisitely designed to bring you a brisk rush of York Peppermint Patty, or Triple Cinnamon Heaven, or whatever flavor that drives those tubes off of the shelves in the confusing dental aisle of your local supermarket or drugstore.


Toothpaste definitely tastes better than Cream of Barf. And that’s why you use it. Not because it’s good for you. You use toothpaste because it tastes good, and because it makes you accept your mouth as part of your face again.

From a marketing perspective, all of the other things that are in your toothpaste are in there to give it additional perceived value. So let’s deconstruct these ingredients, shall we?


1. Fluoride.

This was probably the first additive to toothpaste that brought it under the jurisdiction of the Food & Drug Administration and made toothpaste part drug, part cosmetic. Over time, a fluoride toothpaste can improve the strength of teeth, but the fluoride itself does nothing to make teeth cleaner. Some people are scared of fluoride so they don’t use it. Their choice. Professionally speaking, I know that the benefits of a fluoride additive far outweigh the risks.

2. Foam.

Sodium Lauryl Sulfate is soap. Soap has a creamy, thick texture that American tongues especially like and equate to the feeling of cleanliness. There’s not enough surfactant, though, in toothpaste foam to break up the goo that grows on your teeth. If these bubbles scrubbed, you’d better believe that they would also scrub your delicate gum tissues into a bloody pulp.

3. Abrasive particles.

Most toothpastes use hydrated silica as the grit that polishes teeth. You’re probably most familiar with it as the clear beady stuff in the “Do Not Eat” packets. Depending on the size and shape of the particles, silica is the whitening ingredient in most whitening toothpastes. But whitening toothpaste cannot get your teeth any whiter than a professional dental cleaning, because it only cleans the surface. Two weeks to a whiter smile? How about 30 minutes with your hygienist? It’s much more efficient and less harsh.

4. Desensitizers.

Teeth that are sensitive to hot, cold, sweets, or a combination can benefit from the addition of potassium nitrate or stannous fluoride to a toothpaste. This is more of a palliative treatment, when the pain is the problem. Good old Time will usually make teeth feel better, too, unless the pain is coming from a cavity. Yeah, I’m talking to you, the person who is trying to heal the hole in their tooth with Sensodyne.

5. Tartar control.

It burns! It burns! If your toothpaste has a particular biting flavor, it might contain tetrasodium pyrophosphate, an ingredient that is supposed to keep calcium phosphate salts (tartar, or calculus) from fossilizing on the back of your lower front teeth. A little tartar on your teeth doesn’t harm you unless it gets really thick and you can no longer keep it clean. One problem with tartar control toothpastes is that in order for the active ingredient to work, it has to be dissolved in a stronger detergent than usual, which can affect people that are sensitive to a high pH.

6. Triclosan.

This antimicrobial is supposed to reduce infections between the gum and tooth. However, if you just keep the germs off of your teeth in the first place it’s pretty much a waste of an extra ingredient. Its safety has been questioned but, like fluoride, the bulk of the scientific research easily demonstrates that the addition of triclosan in toothpaste does much more good than harm.

Why toothpaste can be bad for you.

Let’s just say it’s not the toothpaste’s fault. It’s yours. The toothpaste is just the co-dependent enabler. You’re the one with the problem.

Remember, toothpaste is a cosmetic, first and foremost. It doesn’t clean your teeth by itself. Just in case you think I’m making this up I’ve included clinical studies in the references at the end of this article that show how ineffective toothpaste really is.

peasized

• You’re using too much.

Don’t be so suggestible! Toothpaste ads show you how to use up the tube more quickly. Just use 1/3 as much, the size of a pea. It will still taste good, I promise! And too much foam can make you lose track of where your teeth actually are located.

• You’re not taking enough time.

At least two minutes. Any less and you’re missing spots. Just ’cause it tastes better doesn’t mean you did a good job.

• You’re not paying attention.

I’ve seen people brush the same four spots for two minutes and miss the other 60% of their mouth.brushguide The toothbrush needs to touch every crevice of every tooth, not just where it lands when you go into autopilot and start thinking about what you’re going to wear that day. It’s the toothbrush friction that cleans your teeth, not the cleaning product. Plaque is a growth, like the pink or grey mildew that grows around the edges of your shower. You’ve gotta rub it off to get it off. No tooth cleaning liquid, paste, creme, gel, or powder is going to make as much of a difference as your attention to detail will.

The solution.

Use what you like. It’s that simple. If it tastes good and feels clean to you, you’ll use it more often, brush longer, feel better, be healthier.

You can use baking soda, or coconut oil, or your favorite toothpaste, or even just plain water. The key is to have a good technique and to brush often. A music video makes this demonstration a little more fun than your usual lecture at the dental office, although, in my opinion you really still need to feel what it is like to MASH THE BRISTLES OF A SOFT TOOTHBRUSH INTO YOUR GUMS:


A little more serious video from my pal Dr. Mark Burhenne where he demonstrates how to be careful with your toothbrush bristles:


Final word.

♬ It’s all about that Bass, ’bout that Bass, no bubbles. ♬ Heh, dentistry in-joke there.

Seriously, though, the bottom line is that your paste will mask brushing technique issues, so don’t put so much faith in the power of toothpaste.

Also you may have heard that some toothpastes contain decorative plastic that can get swallowed. Yeah, that was a DentalBuzz report I wrote that went viral earlier this year. And while I can’t claim total victory on that front, at least the company in question has promised that the plastic will no longer be added to their toothpaste lines very soon due to the overwhelming amount of letters, emails, and phone calls that they received as a result of people reading that article and making a difference.

But now I’m tired of talking about toothpaste.

Next topic?

I’m bringing pyorrhea back.
    """,
        """
I’ve been a user of Colgate Total Whitening Toothpaste for many years because I’ve always tried to maintain a healthy smile (I’m a receptionist so I need a white smile). But because I drink coffee at least twice a day (sometimes more!) and a lot of herbal teas, I’ve found that using just this toothpaste alone doesn’t really get my teeth white...

The best way to get white teeth is to really try some professional products specifically for tooth whitening. I’ve tried a few products, like Crest White Strips and found that the strips are really not as good as the trays. Although the Crest White Strips are easy to use, they really DO NOT cover your teeth perfectly like some other professional dental whitening kits. This Product did cover my teeth well however because of their custom heat trays, and whitening my teeth A LOT. I would say if you really want white teeth, use the Colgate Toothpaste and least 2 times a day, along side a professional Gel product like Shine Whitening.
    """,
        """
The first feature is the price, and it is right.

Next, I consider whether it will be neat to use. It is. Sometimes when I buy those new hard plastic containers, they actually get messy. Also I cannot get all the toothpaste out. It is easy to get the paste out of Colgate Total Whitening Paste without spraying it all over the cabinet.

If it does not taste good, I won't use it. Some toothpaste burns my mouth so bad that brushing my teeth is a painful experience. This one doesn't burn. It tastes simply the way toothpaste is supposed to taste.

Whitening is important. This one is supposed ot whiten. After spending money to whiten my teeth, I need a product to help ward off the bad effects of coffee and tea.

Avoiding all kinds of oral pathology is a major consideration. This toothpaste claims that it can help fight cavities, gingivitis, plaque, tartar, and bad breath.

I hope this product stays on the market a long time and does not change.
    """,
    ]


================================================
FILE: featuretools/tests/utils_tests/__init__.py
================================================


================================================
FILE: featuretools/tests/utils_tests/test_config.py
================================================
import logging
import os

from featuretools.config_init import initialize_logging

logging_env_vars = {
    "FEATURETOOLS_LOG_LEVEL": "debug",
    "FEATURETOOLS_ES_LOG_LEVEL": "critical",
    "FEATURETOOLS_BACKEND_LOG_LEVEL": "error",
}


def test_logging_defaults():
    old_env_vars = {}
    for env_var in logging_env_vars:
        old_env_vars[env_var] = os.environ.get(env_var, None)
        if old_env_vars[env_var] is not None:
            del os.environ[env_var]

    initialize_logging()
    main_logger = logging.getLogger("featuretools")
    assert main_logger.getEffectiveLevel() == logging.INFO
    es_logger = logging.getLogger("featuretools.entityset")
    assert es_logger.getEffectiveLevel() == logging.INFO
    backend_logger = logging.getLogger("featuretools.computation_backend")
    assert backend_logger.getEffectiveLevel() == logging.INFO

    for env_var, value in old_env_vars.items():
        if value is not None:
            os.environ[env_var] = value


def test_logging_set_via_env():
    old_env_vars = {}
    for env_var, value in logging_env_vars.items():
        old_env_vars[env_var] = os.environ.get(env_var, None)
        os.environ[env_var] = value

    initialize_logging()
    main_logger = logging.getLogger("featuretools")
    assert main_logger.getEffectiveLevel() == logging.DEBUG
    es_logger = logging.getLogger("featuretools.entityset")
    assert es_logger.getEffectiveLevel() == logging.CRITICAL
    backend_logger = logging.getLogger("featuretools.computation_backend")
    assert backend_logger.getEffectiveLevel() == logging.ERROR

    for env_var, value in old_env_vars.items():
        if value is not None:
            os.environ[env_var] = value


================================================
FILE: featuretools/tests/utils_tests/test_description_utils.py
================================================
from featuretools.utils.description_utils import convert_to_nth


def test_first():
    assert convert_to_nth(1) == "1st"
    assert convert_to_nth(21) == "21st"
    assert convert_to_nth(131) == "131st"


def test_second():
    assert convert_to_nth(2) == "2nd"
    assert convert_to_nth(22) == "22nd"
    assert convert_to_nth(232) == "232nd"


def test_third():
    assert convert_to_nth(3) == "3rd"
    assert convert_to_nth(23) == "23rd"
    assert convert_to_nth(133) == "133rd"


def test_nth():
    assert convert_to_nth(4) == "4th"
    assert convert_to_nth(11) == "11th"
    assert convert_to_nth(12) == "12th"
    assert convert_to_nth(13) == "13th"
    assert convert_to_nth(111) == "111th"
    assert convert_to_nth(112) == "112th"
    assert convert_to_nth(113) == "113th"


================================================
FILE: featuretools/tests/utils_tests/test_entry_point.py
================================================
import pandas as pd
import pytest

from featuretools import dfs


@pytest.fixture
def entry_points_dfs():
    cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]})
    transactions_df = pd.DataFrame(
        {
            "id": [1, 2, 3, 4, 5, 6],
            "card_id": [1, 2, 1, 3, 4, 5],
            "transaction_time": [10, 12, 13, 20, 21, 20],
            "fraud": [True, False, True, False, True, True],
        },
    )
    return cards_df, transactions_df


class MockEntryPoint(object):
    def on_call(self, kwargs):
        self.kwargs = kwargs

    def on_error(self, error, runtime):
        self.error = error

    def on_return(self, return_value, runtime):
        self.return_value = return_value

    def load(self):
        return self

    def __call__(self):
        return self


class MockPkgResources(object):
    def __init__(self, entry_point):
        self.entry_point = entry_point

    def iter_entry_points(self, name):
        return [self.entry_point]


def test_entry_point(es, monkeypatch):
    entry_point = MockEntryPoint()
    # overrides a module used in the entry_point decorator for dfs
    # so the decorator will use this mock entry point
    monkeypatch.setitem(
        dfs.__globals__["entry_point"].__globals__,
        "pkg_resources",
        MockPkgResources(entry_point),
    )
    fm, fl = dfs(entityset=es, target_dataframe_name="customers")
    assert "entityset" in entry_point.kwargs.keys()
    assert "target_dataframe_name" in entry_point.kwargs.keys()
    assert (fm, fl) == entry_point.return_value


def test_entry_point_error(es, monkeypatch):
    entry_point = MockEntryPoint()
    monkeypatch.setitem(
        dfs.__globals__["entry_point"].__globals__,
        "pkg_resources",
        MockPkgResources(entry_point),
    )
    with pytest.raises(KeyError):
        dfs(entityset=es, target_dataframe_name="missing_dataframe")

    assert isinstance(entry_point.error, KeyError)


def test_entry_point_detect_arg(monkeypatch, entry_points_dfs):
    cards_df = entry_points_dfs[0]
    transactions_df = entry_points_dfs[1]
    cards_df = pd.DataFrame({"id": [1, 2, 3, 4, 5]})
    transactions_df = pd.DataFrame(
        {
            "id": [1, 2, 3, 4, 5, 6],
            "card_id": [1, 2, 1, 3, 4, 5],
            "transaction_time": [10, 12, 13, 20, 21, 20],
            "fraud": [True, False, True, False, True, True],
        },
    )
    dataframes = {
        "cards": (cards_df, "id"),
        "transactions": (transactions_df, "id", "transaction_time"),
    }
    relationships = [("cards", "id", "transactions", "card_id")]
    entry_point = MockEntryPoint()
    monkeypatch.setitem(
        dfs.__globals__["entry_point"].__globals__,
        "pkg_resources",
        MockPkgResources(entry_point),
    )
    fm, fl = dfs(dataframes, relationships, target_dataframe_name="cards")
    assert "dataframes" in entry_point.kwargs.keys()
    assert "relationships" in entry_point.kwargs.keys()
    assert "target_dataframe_name" in entry_point.kwargs.keys()


================================================
FILE: featuretools/tests/utils_tests/test_gen_utils.py
================================================
import pandas as pd
import pytest
from woodwork import list_logical_types as ww_list_logical_types
from woodwork import list_semantic_tags as ww_list_semantic_tags

from featuretools import list_logical_types, list_semantic_tags
from featuretools.utils.gen_utils import (
    camel_and_title_to_snake,
    import_or_none,
    import_or_raise,
)


def test_import_or_raise_errors():
    with pytest.raises(ImportError, match="error message"):
        import_or_raise("_featuretools", "error message")


def test_import_or_raise_imports():
    math = import_or_raise("math", "error message")
    assert math.ceil(0.1) == 1


def test_import_or_none():
    math = import_or_none("math")
    assert math.ceil(0.1) == 1

    bad_lib = import_or_none("_featuretools")
    assert bad_lib is None


@pytest.fixture
def df():
    return pd.DataFrame({"id": range(5)})


def test_list_logical_types():
    ft_ltypes = list_logical_types()
    ww_ltypes = ww_list_logical_types()
    assert ft_ltypes.equals(ww_ltypes)


def test_list_semantic_tags():
    ft_semantic_tags = list_semantic_tags()
    ww_semantic_tags = ww_list_semantic_tags()
    assert ft_semantic_tags.equals(ww_semantic_tags)


def test_camel_and_title_to_snake():
    assert camel_and_title_to_snake("Top3Words") == "top_3_words"
    assert camel_and_title_to_snake("top3Words") == "top_3_words"
    assert camel_and_title_to_snake("Top100Words") == "top_100_words"
    assert camel_and_title_to_snake("top100Words") == "top_100_words"
    assert camel_and_title_to_snake("Top41") == "top_41"
    assert camel_and_title_to_snake("top41") == "top_41"
    assert camel_and_title_to_snake("41TopWords") == "41_top_words"
    assert camel_and_title_to_snake("TopThreeWords") == "top_three_words"
    assert camel_and_title_to_snake("topThreeWords") == "top_three_words"
    assert camel_and_title_to_snake("top_three_words") == "top_three_words"
    assert camel_and_title_to_snake("over_65") == "over_65"
    assert camel_and_title_to_snake("65_and_over") == "65_and_over"
    assert camel_and_title_to_snake("USDValue") == "usd_value"


================================================
FILE: featuretools/tests/utils_tests/test_recommend_primitives.py
================================================
import logging

import pandas as pd
import pytest
from woodwork.logical_types import NaturalLanguage
from woodwork.table_schema import ColumnSchema

from featuretools import EntitySet
from featuretools.primitives import Day, TransformPrimitive
from featuretools.utils.recommend_primitives import (
    DEFAULT_EXCLUDED_PRIMITIVES,
    TIME_SERIES_PRIMITIVES,
    _recommend_non_numeric_primitives,
    _recommend_skew_numeric_primitives,
    get_recommended_primitives,
)


@pytest.fixture
def moderate_right_skewed_df():
    return pd.DataFrame(
        {"moderately right skewed": [2, 3, 4, 4, 4, 5, 5, 7, 9, 11, 12, 13, 15]},
    )


@pytest.fixture
def heavy_right_skewed_df():
    return pd.DataFrame(
        {"heavy right skewed": [1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 9, 11, 13]},
    )


@pytest.fixture
def left_skewed_df():
    return pd.DataFrame(
        {"left skewed": [2, 3, 4, 5, 7, 9, 11, 11, 11, 12, 12, 12, 13, 15]},
    )


@pytest.fixture
def skewed_df_zeros():
    return pd.DataFrame({"zeros": [-1, 0, 0, 1, 2, 2, 3, 4, 5, 7, 9]})


@pytest.fixture
def normal_df():
    return pd.DataFrame({"normal": [2, 3, 4, 5, 5, 6, 6, 7, 7, 8, 9, 10, 11]})


@pytest.fixture
def right_skew_moderate_and_heavy_df(moderate_right_skewed_df, heavy_right_skewed_df):
    return pd.concat([moderate_right_skewed_df, heavy_right_skewed_df], axis=1)


@pytest.fixture
def es_with_skewed_dfs(
    moderate_right_skewed_df,
    heavy_right_skewed_df,
    left_skewed_df,
    skewed_df_zeros,
    normal_df,
    right_skew_moderate_and_heavy_df,
):
    es = EntitySet()
    es.add_dataframe(moderate_right_skewed_df, "moderate_right_skewed_df", "id")
    es.add_dataframe(heavy_right_skewed_df, "heavy_right_skewed_df", "id")
    es.add_dataframe(left_skewed_df, "left_skewed_df", "id")
    es.add_dataframe(skewed_df_zeros, "skewed_df_zeros", "id")
    es.add_dataframe(normal_df, "normal_df", "id")
    es.add_dataframe(
        right_skew_moderate_and_heavy_df,
        "right_skew_moderate_and_heavy_df",
        "id",
    )
    return es


def test_recommend_skew_numeric_primitives(es_with_skewed_dfs):
    valid_skew_primtives = set(["square_root", "natural_logarithm"])
    valid_prims = [
        "cosine",
        "square_root",
        "natural_logarithm",
        "sine",
    ]
    assert _recommend_skew_numeric_primitives(
        es_with_skewed_dfs,
        "moderate_right_skewed_df",
        valid_prims,
    ) == set(["square_root"])
    assert _recommend_skew_numeric_primitives(
        es_with_skewed_dfs,
        "heavy_right_skewed_df",
        valid_skew_primtives,
    ) == set(["natural_logarithm"])
    assert (
        _recommend_skew_numeric_primitives(
            es_with_skewed_dfs,
            "left_skewed_df",
            valid_skew_primtives,
        )
        == set()
    )
    assert (
        _recommend_skew_numeric_primitives(
            es_with_skewed_dfs,
            "skewed_df_zeros",
            valid_skew_primtives,
        )
        == set()
    )
    assert (
        _recommend_skew_numeric_primitives(
            es_with_skewed_dfs,
            "normal_df",
            valid_skew_primtives,
        )
        == set()
    )
    assert (
        _recommend_skew_numeric_primitives(
            es_with_skewed_dfs,
            "right_skew_moderate_and_heavy_df",
            valid_skew_primtives,
        )
        == valid_skew_primtives
    )


def test_recommend_non_numeric_primitives(make_es):
    ecom_es_customers = EntitySet()
    ecom_es_customers.add_dataframe(make_es["customers"])
    valid_primitives = [
        "day",
        "num_characters",
        "natural_logarithm",
        "sine",
    ]
    actual_recommendations = _recommend_non_numeric_primitives(
        ecom_es_customers,
        "customers",
        valid_primitives,
    )
    expected_recommendations = set(
        [
            "day",
            "num_characters",
        ],
    )
    assert expected_recommendations == actual_recommendations


def test_recommend_skew_numeric_primitives_exception(make_es, caplog):
    class MockExceptionPrimitive(TransformPrimitive):
        """Count the number of times the string value occurs."""

        name = "mock_primitive_with_exception"
        input_types = [ColumnSchema(logical_type=NaturalLanguage)]
        return_type = ColumnSchema(semantic_tags={"numeric"})

        def get_function(self):
            def make_exception(column):
                raise Exception("this primitive has an exception")

            return make_exception

    ecom_es_customers = EntitySet()
    ecom_es_customers.add_dataframe(make_es["customers"])
    valid_primitives = [MockExceptionPrimitive(), Day()]
    logger = logging.getLogger("featuretools")
    logger.propagate = True
    actual_recommendations = _recommend_non_numeric_primitives(
        ecom_es_customers,
        "customers",
        valid_primitives,
    )
    logger.propagate = False
    expected_recommendations = set(["day"])
    assert expected_recommendations == actual_recommendations
    assert (
        "Exception with feature MOCK_PRIMITIVE_WITH_EXCEPTION(favorite_quote) with primitive mock_primitive_with_exception: this primitive has an exception"
        in caplog.text
    )


def test_get_recommended_primitives_time_series(make_es):
    ecom_es_log = EntitySet()
    ecom_es_log.add_dataframe(make_es["log"])
    ecom_es_log["log"].ww.set_time_index("datetime")
    actual_recommendations_ts = get_recommended_primitives(
        ecom_es_log,
        True,
    )
    for ts_prim in TIME_SERIES_PRIMITIVES:
        assert ts_prim in actual_recommendations_ts


def test_get_recommended_primitives(make_es):
    ecom_es_customers = EntitySet()
    ecom_es_customers.add_dataframe(make_es["customers"])
    actual_recommendations = get_recommended_primitives(
        ecom_es_customers,
        False,
    )
    expected_recommendations = [
        "day",
        "num_characters",
        "natural_logarithm",
        "punctuation_count",
        "mean_characters_per_word",
        "is_weekend",
        "whitespace_count",
        "median_word_length",
        "month",
        "total_word_length",
        "weekday",
        "day_of_year",
        "week",
        "quarter",
        "email_address_to_domain",
        "number_of_common_words",
        "num_words",
        "num_unique_separators",
        "age",
        "year",
        "is_leap_year",
        "days_in_month",
        "is_free_email_domain",
        "number_of_unique_words",
    ]
    for prim in expected_recommendations:
        assert prim in actual_recommendations

    for ts_prim in TIME_SERIES_PRIMITIVES:
        assert ts_prim not in actual_recommendations


def test_get_recommended_primitives_exclude(make_es):
    ecom_es_customers = EntitySet()
    ecom_es_customers.add_dataframe(make_es["customers"])
    extra_exclude = ["num_characters", "natural_logarithm"]
    prims_to_exclude = DEFAULT_EXCLUDED_PRIMITIVES + extra_exclude
    actual_recommendations = get_recommended_primitives(
        ecom_es_customers,
        False,
        prims_to_exclude,
    )

    for ex_prim in extra_exclude:
        assert ex_prim not in actual_recommendations


def test_get_recommended_primitives_empty_es_error():
    error_msg = "No DataFrame in EntitySet found. Please add a DataFrame."
    empty_es = EntitySet()
    with pytest.raises(IndexError, match=error_msg):
        get_recommended_primitives(
            empty_es,
            False,
        )


def test_get_recommended_primitives_multi_table_es_error(make_es):
    error_msg = "Multi-table EntitySets are currently not supported. Please only use a single table EntitySet."
    with pytest.raises(IndexError, match=error_msg):
        get_recommended_primitives(
            make_es,
            False,
        )


================================================
FILE: featuretools/tests/utils_tests/test_time_utils.py
================================================
from datetime import datetime, timedelta
from itertools import chain

import numpy as np
import pandas as pd
import pytest

from featuretools.utils import convert_time_units, make_temporal_cutoffs
from featuretools.utils.time_utils import (
    calculate_trend,
    convert_datetime_to_floats,
    convert_timedelta_to_floats,
)


def test_make_temporal_cutoffs():
    instance_ids = pd.Series(range(10))
    cutoffs = pd.date_range(start="1/2/2015", periods=10, freq="1d")
    temporal_cutoffs_by_nwindows = make_temporal_cutoffs(
        instance_ids,
        cutoffs,
        window_size="1h",
        num_windows=2,
    )

    assert temporal_cutoffs_by_nwindows.shape[0] == 20
    actual_instances = chain.from_iterable([[i, i] for i in range(10)])
    actual_times = [
        "1/1/2015 23:00:00",
        "1/2/2015 00:00:00",
        "1/2/2015 23:00:00",
        "1/3/2015 00:00:00",
        "1/3/2015 23:00:00",
        "1/4/2015 00:00:00",
        "1/4/2015 23:00:00",
        "1/5/2015 00:00:00",
        "1/5/2015 23:00:00",
        "1/6/2015 00:00:00",
        "1/6/2015 23:00:00",
        "1/7/2015 00:00:00",
        "1/7/2015 23:00:00",
        "1/8/2015 00:00:00",
        "1/8/2015 23:00:00",
        "1/9/2015 00:00:00",
        "1/9/2015 23:00:00",
        "1/10/2015 00:00:00",
        "1/10/2015 23:00:00",
        "1/11/2015 00:00:00",
        "1/11/2015 23:00:00",
    ]
    actual_times = [pd.Timestamp(c) for c in actual_times]

    for computed, actual in zip(
        temporal_cutoffs_by_nwindows["instance_id"],
        actual_instances,
    ):
        assert computed == actual
    for computed, actual in zip(temporal_cutoffs_by_nwindows["time"], actual_times):
        assert computed == actual

    cutoffs = [pd.Timestamp("1/2/2015")] * 9 + [pd.Timestamp("1/3/2015")]
    starts = [pd.Timestamp("1/1/2015")] * 9 + [pd.Timestamp("1/2/2015")]
    actual_times = ["1/1/2015 00:00:00", "1/2/2015 00:00:00"] * 9
    actual_times += ["1/2/2015 00:00:00", "1/3/2015 00:00:00"]
    actual_times = [pd.Timestamp(c) for c in actual_times]
    temporal_cutoffs_by_wsz_start = make_temporal_cutoffs(
        instance_ids,
        cutoffs,
        window_size="1d",
        start=starts,
    )

    for computed, actual in zip(
        temporal_cutoffs_by_wsz_start["instance_id"],
        actual_instances,
    ):
        assert computed == actual
    for computed, actual in zip(temporal_cutoffs_by_wsz_start["time"], actual_times):
        assert computed == actual

    cutoffs = [pd.Timestamp("1/2/2015")] * 9 + [pd.Timestamp("1/3/2015")]
    starts = [pd.Timestamp("1/1/2015")] * 10
    actual_times = ["1/1/2015 00:00:00", "1/2/2015 00:00:00"] * 9
    actual_times += ["1/1/2015 00:00:00", "1/3/2015 00:00:00"]
    actual_times = [pd.Timestamp(c) for c in actual_times]
    temporal_cutoffs_by_nw_start = make_temporal_cutoffs(
        instance_ids,
        cutoffs,
        num_windows=2,
        start=starts,
    )

    for computed, actual in zip(
        temporal_cutoffs_by_nw_start["instance_id"],
        actual_instances,
    ):
        assert computed == actual
    for computed, actual in zip(temporal_cutoffs_by_nw_start["time"], actual_times):
        assert computed == actual


def test_convert_time_units():
    units = {
        "years": 31540000,
        "months": 2628000,
        "days": 86400,
        "hours": 3600,
        "minutes": 60,
        "seconds": 1,
        "milliseconds": 0.001,
        "nanoseconds": 0.000000001,
    }
    for each in units:
        assert convert_time_units(units[each] * 2, each) == 2
        assert np.isclose(convert_time_units(float(units[each] * 2), each), 2)

    error_text = "Invalid unit given, make sure it is plural"
    with pytest.raises(ValueError, match=error_text):
        convert_time_units("jnkwjgn", 10)


@pytest.mark.parametrize(
    "dt, expected_floats",
    [
        (
            pd.Series(
                [
                    datetime(2010, 1, 1, 11, 45, 0),
                    datetime(2010, 1, 1, 12, 55, 15),
                    datetime(2010, 1, 1, 11, 57, 30),
                    datetime(2010, 1, 1, 11, 12),
                    datetime(2010, 1, 1, 11, 12, 15),
                ],
            ),
            pd.Series([21039105.0, 21039175.25, 21039117.5, 21039072.0, 21039072.25]),
        ),
        (
            pd.Series(
                list(pd.date_range(start="2017-01-01", freq="1d", periods=3))
                + list(pd.date_range(start="2017-01-10", freq="2d", periods=4))
                + list(pd.date_range(start="2017-01-22", freq="1d", periods=7)),
            ),
            pd.Series(
                [
                    17167.0,
                    17168.0,
                    17169.0,
                    17176.0,
                    17178.0,
                    17180.0,
                    17182.0,
                    17188.0,
                    17189.0,
                    17190.0,
                    17191.0,
                    17192.0,
                    17193.0,
                    17194.0,
                ],
            ),
        ),
    ],
)
def test_convert_datetime_floats(dt, expected_floats):
    actual_floats = convert_datetime_to_floats(dt)
    pd.testing.assert_series_equal(pd.Series(actual_floats), expected_floats)


@pytest.mark.parametrize(
    "td, expected_floats",
    [
        (
            pd.Series(
                [
                    pd.Timedelta(2, "day"),
                    pd.Timedelta(120000000),
                    pd.Timedelta(48, "sec"),
                    pd.Timedelta(30, "min"),
                    pd.Timedelta(12, "hour"),
                ],
            ),
            pd.Series(
                [
                    2.0,
                    1.388888888888889e-06,
                    0.0005555555555555556,
                    0.020833333333333332,
                    0.5,
                ],
            ),
        ),
        (
            pd.Series(
                [
                    timedelta(days=4),
                    timedelta(milliseconds=4000000),
                    timedelta(hours=2, seconds=49),
                ],
            ),
            pd.Series([4.0, 0.0462962962962963, 0.08390046296296297]),
        ),
    ],
)
def test_convert_timedelta_to_floats(td, expected_floats):
    actual_floats = convert_timedelta_to_floats(td)
    pd.testing.assert_series_equal(pd.Series(actual_floats), expected_floats)


@pytest.mark.parametrize(
    "series,expected_trends",
    [
        (
            # using datetimes
            pd.Series(
                data=[0, 5, 10],
                index=[
                    datetime(2019, 1, 1),
                    datetime(2019, 1, 2),
                    datetime(2019, 1, 3),
                ],
            ),
            5.0,
        ),
        (
            # using pd.Timestamp
            pd.Series(
                data=[0, -5, 3],
                index=pd.date_range(start="2019-01-01", freq="1D", periods=3),
            ),
            1.4999999999999998,
        ),
        (
            pd.Series(
                data=[1, 2, 4, 8, 16],
                index=pd.date_range(start="2019-01-01", freq="1D", periods=5),
            ),
            3.6000000000000005,
        ),
        (
            # using pd.Timedelta with no change in time
            pd.Series(
                data=[1, 2, 3],
                index=[
                    pd.Timedelta(120000000),
                    pd.Timedelta(120000000),
                    pd.Timedelta(120000000),
                ],
            ),
            0,
        ),
    ],
)
def test_calculate_trend(series, expected_trends):
    actual_trends = calculate_trend(series)
    assert np.isclose(actual_trends, expected_trends)


================================================
FILE: featuretools/tests/utils_tests/test_trie.py
================================================
from featuretools.utils import Trie


def test_get_node():
    t = Trie(default=lambda: "default")

    t.get_node([1, 2, 3]).value = "123"
    t.get_node([1, 2, 4]).value = "124"
    sub = t.get_node([1, 2])
    assert sub.get_node([3]).value == "123"
    assert sub.get_node([4]).value == "124"

    sub.get_node([4, 5]).value = "1245"
    assert t.get_node([1, 2, 4, 5]).value == "1245"


def test_setting_and_getting():
    t = Trie(default=lambda: "default")
    assert t.get_node([1, 2, 3]).value == "default"

    t.get_node([1, 2, 3]).value = "123"
    t.get_node([1, 2, 4]).value = "124"
    assert t.get_node([1, 2, 3]).value == "123"
    assert t.get_node([1, 2, 4]).value == "124"

    assert t.get_node([1]).value == "default"
    t.get_node([1]).value = "1"
    assert t.get_node([1]).value == "1"

    t.get_node([1, 2, 3]).value = "updated"
    assert t.get_node([1, 2, 3]).value == "updated"


def test_iteration():
    t = Trie(default=lambda: "default", path_constructor=tuple)

    t.get_node((1, 2, 3)).value = "123"
    t.get_node((1, 2, 4)).value = "124"
    expected = [
        ((), "default"),
        ((1,), "default"),
        ((1, 2), "default"),
        ((1, 2, 3), "123"),
        ((1, 2, 4), "124"),
    ]

    for i, value in enumerate(t):
        assert value == expected[i]


================================================
FILE: featuretools/tests/utils_tests/test_utils_info.py
================================================
import os

import pytest

from featuretools import __version__
from featuretools.utils import (
    get_featuretools_root,
    get_installed_packages,
    get_sys_info,
    show_info,
)


@pytest.fixture
def this_dir():
    return os.path.dirname(os.path.abspath(__file__))


def test_show_info(capsys):
    show_info()
    captured = capsys.readouterr()
    assert "Featuretools version" in captured.out
    assert "Featuretools installation directory:" in captured.out
    assert __version__ in captured.out
    assert "SYSTEM INFO" in captured.out


def test_sys_info():
    sys_info = get_sys_info()
    info_keys = [
        "python",
        "python-bits",
        "OS",
        "OS-release",
        "machine",
        "processor",
        "byteorder",
        "LC_ALL",
        "LANG",
        "LOCALE",
    ]
    found_keys = [k for k, _ in sys_info]
    assert set(info_keys).issubset(found_keys)


def test_installed_packages():
    installed_packages = get_installed_packages()
    # Per PEP 426, package names are case insensitive
    # Underscore and hyphen are equivalent
    installed_set = {
        name.lower().replace("-", "_") for name in installed_packages.keys()
    }
    requirements = [
        "pandas",
        "numpy",
        "tqdm",
        "cloudpickle",
        "psutil",
    ]
    assert set(requirements).issubset(installed_set)


def test_get_featuretools_root(this_dir):
    root = os.path.abspath(os.path.join(this_dir, "..", ".."))
    assert get_featuretools_root() == root


================================================
FILE: featuretools/utils/__init__.py
================================================
# flake8: noqa
from featuretools.utils.api import *


================================================
FILE: featuretools/utils/api.py
================================================
# flake8: noqa
from featuretools.utils.entry_point import entry_point
from featuretools.utils.gen_utils import make_tqdm_iterator
from featuretools.utils.time_utils import (
    calculate_trend,
    convert_time_units,
    make_temporal_cutoffs,
)
from featuretools.utils.trie import Trie
from featuretools.utils.utils_info import (
    get_featuretools_root,
    get_installed_packages,
    get_sys_info,
    show_info,
)


================================================
FILE: featuretools/utils/common_tld_utils.py
================================================
# put longer TLDs first to avoid catching a small part of a longer TLD and escape periods
COMMON_TLDS = [
    "management",
    "technology",
    "solutions",
    "delivery",
    "services",
    "software",
    "digital",
    "finance",
    "monster",
    "network",
    "support",
    "systems",
    "website",
    "agency",
    "design",
    "events",
    "global",
    "health",
    "online",
    "stream",
    "studio",
    "travel",
    "apple",
    "click",
    "cloud",
    "email",
    "games",
    "group",
    "media",
    "ninja",
    "press",
    "rocks",
    "space",
    "store",
    "today",
    "tools",
    "video",
    "works",
    "world",
    "aero",
    "arpa",
    "asia",
    "bank",
    "best",
    "blog",
    "buzz",
    "care",
    "casa",
    "chat",
    "club",
    "coop",
    "cyou",
    "desi",
    "farm",
    "goog",
    "guru",
    "host",
    "info",
    "jobs",
    "life",
    "link",
    "live",
    "mobi",
    "name",
    "news",
    "page",
    "plus",
    "shop",
    "site",
    "team",
    "tech",
    "work",
    "zone",
    "app",
    "aws",
    "bid",
    "biz",
    "box",
    "cam",
    "cat",
    "com",
    "dev",
    "edu",
    "eus",
    "fun",
    "gov",
    "icu",
    "int",
    "ltd",
    "mil",
    "net",
    "nyc",
    "one",
    "onl",
    "org",
    "ovh",
    "pro",
    "pub",
    "run",
    "sap",
    "top",
    "vip",
    "win",
    "xxx",
    "xyz",
    "ac",
    "ad",
    "ae",
    "ag",
    "ai",
    "al",
    "am",
    "ar",
    "at",
    "au",
    "az",
    "ba",
    "bd",
    "be",
    "bg",
    "br",
    "by",
    "bz",
    "ca",
    "cc",
    "cf",
    "ch",
    "cl",
    "cm",
    "cn",
    "co",
    "cr",
    "cu",
    "cx",
    "cy",
    "cz",
    "de",
    "dk",
    "do",
    "ec",
    "ee",
    "eg",
    "es",
    "eu",
    "fi",
    "fm",
    "fr",
    "ga",
    "ge",
    "gg",
    "gl",
    "gq",
    "gr",
    "gs",
    "gt",
    "hk",
    "hn",
    "hr",
    "hu",
    "id",
    "ie",
    "il",
    "im",
    "in",
    "io",
    "ir",
    "is",
    "it",
    "jo",
    "jp",
    "ke",
    "kh",
    "ki",
    "kr",
    "kw",
    "kz",
    "la",
    "lb",
    "li",
    "lk",
    "lt",
    "lu",
    "lv",
    "ly",
    "ma",
    "md",
    "me",
    "mk",
    "ml",
    "mm",
    "mn",
    "ms",
    "mu",
    "mx",
    "my",
    "nf",
    "ng",
    "nl",
    "no",
    "np",
    "nu",
    "nz",
    "om",
    "pa",
    "pe",
    "ph",
    "pk",
    "pl",
    "pr",
    "ps",
    "pt",
    "pw",
    "py",
    "qa",
    "re",
    "ro",
    "rs",
    "ru",
    "sa",
    "sc",
    "se",
    "sg",
    "sh",
    "si",
    "sk",
    "so",
    "st",
    "su",
    "sv",
    "sx",
    "th",
    "tj",
    "tk",
    "tn",
    "to",
    "tr",
    "tt",
    "tv",
    "tw",
    "ua",
    "ug",
    "uk",
    "us",
    "uy",
    "vc",
    "ve",
    "vn",
    "ws",
    "za",
]


================================================
FILE: featuretools/utils/description_utils.py
================================================
def convert_to_nth(integer):
    string_nth = str(integer)
    end_int = integer % 10
    if end_int == 1 and integer % 100 != 11:
        return str(integer) + "st"
    elif end_int == 2 and integer % 100 != 12:
        return str(string_nth) + "nd"
    elif end_int == 3 and integer % 100 != 13:
        return str(string_nth) + "rd"
    else:
        return str(string_nth) + "th"


================================================
FILE: featuretools/utils/entry_point.py
================================================
import time
from functools import wraps
from inspect import signature

import pkg_resources


def entry_point(name):
    def inner_function(func):
        @wraps(func)
        def function_wrapper(*args, **kwargs):
            """function_wrapper of greeting"""
            # add positional args as named kwargs
            on_call_kwargs = kwargs.copy()
            sig = signature(func)
            for arg, parameter in zip(args, sig.parameters):
                on_call_kwargs[parameter] = arg

            # collect and initialize all registered entry points
            entry_points = []
            for entry_point in pkg_resources.iter_entry_points(name):
                entry_point = entry_point.load()
                entry_points.append(entry_point())

            # send arguments before function is called
            for ep in entry_points:
                ep.on_call(on_call_kwargs)

            try:
                # call function
                start = time.time()
                return_value = func(*args, **kwargs)
                runtime = time.time() - start
            except Exception as e:
                runtime = time.time() - start
                # send error
                for ep in entry_points:
                    ep.on_error(error=e, runtime=runtime)
                raise e

            # send return value
            for ep in entry_points:
                ep.on_return(return_value=return_value, runtime=runtime)

            return return_value

        return function_wrapper

    return inner_function


================================================
FILE: featuretools/utils/gen_utils.py
================================================
import importlib
import logging
import re
import sys

from tqdm import tqdm

logger = logging.getLogger("featuretools.utils")


def make_tqdm_iterator(**kwargs):
    options = {"file": sys.stdout, "leave": True}
    options.update(kwargs)
    return tqdm(**options)


def get_relationship_column_id(path):
    _, r = path[0]
    child_link_name = r._child_column_name
    for _, r in path[1:]:
        parent_link_name = child_link_name
        child_link_name = "%s.%s" % (r.parent_name, parent_link_name)
    return child_link_name


def find_descendents(cls):
    """
    A generator which yields all descendent classes of the given class
    (including the given class)

    Args:
        cls (Class): the class to find descendents of
    """
    yield cls
    for sub in cls.__subclasses__():
        for c in find_descendents(sub):
            yield c


def import_or_raise(library, error_msg):
    """
    Attempts to import the requested library.  If the import fails, raises an
    ImportErorr with the supplied

    Args:
        library (str): the name of the library
        error_msg (str): error message to return if the import fails
    """
    try:
        return importlib.import_module(library)
    except ImportError:
        raise ImportError(error_msg)


def import_or_none(library):
    """
    Attemps to import the requested library.

    Args:
        library (str): the name of the library
    Returns: the library if it is installed, else None
    """
    try:
        return importlib.import_module(library)
    except ImportError:
        return None


def camel_and_title_to_snake(name):
    name = re.sub(r"([^_\d]+)(\d+)", r"\1_\2", name)
    name = re.sub("(.)([A-Z][a-z]+)", r"\1_\2", name)
    return re.sub("([a-z0-9])([A-Z])", r"\1_\2", name).lower()


================================================
FILE: featuretools/utils/plot_utils.py
================================================
from featuretools.utils.gen_utils import import_or_raise


def check_graphviz():
    GRAPHVIZ_ERR_MSG = (
        "Please install graphviz to plot."
        + " (See https://featuretools.alteryx.com/en/stable/install.html#installing-graphviz for"
        + " details)"
    )
    graphviz = import_or_raise("graphviz", GRAPHVIZ_ERR_MSG)
    # Try rendering a dummy graph to see if a working backend is installed
    try:
        graphviz.Digraph().pipe(format="svg")
    except graphviz.backend.ExecutableNotFound:
        raise RuntimeError(
            "To plot entity sets, a graphviz backend is required.\n"
            + "Install the backend using one of the following commands:\n"
            + "  Mac OS: brew install graphviz\n"
            + "  Linux (Ubuntu): $ sudo apt install graphviz\n"
            + "  Windows (conda): conda install -c conda-forge python-graphviz\n"
            + "  Windows (pip): pip install graphviz\n"
            + "  Windows (EXE required if graphviz was installed via pip): https://graphviz.org/download/#windows"
            + "  For more details visit: https://featuretools.alteryx.com/en/stable/install.html#installing-graphviz",
        )
    return graphviz


def get_graphviz_format(graphviz, to_file):
    if to_file:
        # Explicitly cast to str in case a Path object was passed in
        to_file = str(to_file)

        split_path = to_file.split(".")
        if len(split_path) < 2:
            raise ValueError(
                "Please use a file extension like '.pdf'"
                + " so that the format can be inferred",
            )

        format_ = split_path[-1]
        valid_formats = graphviz.FORMATS
        if format_ not in valid_formats:
            raise ValueError(
                "Unknown format. Make sure your format is"
                + " amongst the following: %s" % valid_formats,
            )
    else:
        format_ = None
    return format_


def save_graph(graph, to_file, format_):
    # Graphviz always appends the format to the file name, so we need to
    # remove it manually to avoid file names like 'file_name.pdf.pdf'
    offset = len(format_) + 1  # Add 1 for the dot
    output_path = to_file[:-offset]
    graph.render(output_path, cleanup=True)


================================================
FILE: featuretools/utils/recommend_primitives.py
================================================
import logging
from typing import List

from featuretools.computational_backends import calculate_feature_matrix
from featuretools.entityset import EntitySet
from featuretools.primitives.utils import get_transform_primitives
from featuretools.synthesis import dfs, get_valid_primitives

ORDERED_PRIMITIVES = [  # non-numeric primitives that require specific ordering or a time index to be set
    "cum_count",
    "cumulative_time_since_last_false",
    "cumulative_time_since_last_true",
    "diff",
    "diff_datetime",
    "is_first_occurrence",
    "is_last_occurrence",
    "time_since_previous",
]


DEPRECATED_PRIMITIVES = [
    "multiply_boolean",  # functionality duplicated by 'and' primitive
    "numeric_lag",  # deprecated and replaced with `lag`
]

REQUIRED_INPUT_PRIMITIVES = [  # non-numeric primitives that require input
    "count_string",
    "distance_to_holiday",
    "is_in_geobox",
    "not_equal_scalar",
    "equal_scalar",
    "time_since",
    "isin",
]

OTHER_PRIMITIVES_TO_EXCLUDE = [  # Excluding some primitives that can produce too many features or aren't useful in extracting information
    "not",
    "and",
    "or",
    "equal",
    "not_equal",
]

DEFAULT_EXCLUDED_PRIMITIVES = (
    REQUIRED_INPUT_PRIMITIVES
    + DEPRECATED_PRIMITIVES
    + ORDERED_PRIMITIVES
    + OTHER_PRIMITIVES_TO_EXCLUDE
)

# TODO: Make this list more dynamic
TIME_SERIES_PRIMITIVES = [
    "expanding_count",
    "expanding_max",
    "expanding_mean",
    "expanding_min",
    "expanding_std",
    "expanding_trend",
    "lag",
    "rolling_count",
    "rolling_outlier_count",
    "rolling_max",
    "rolling_mean",
    "rolling_min",
    "rolling_std",
    "rolling_trend",
]


# TODO: Support multi-table
def get_recommended_primitives(
    entityset: EntitySet,
    include_time_series_primitives: bool = False,
    excluded_primitives: List[str] = DEFAULT_EXCLUDED_PRIMITIVES,
) -> List[str]:
    """Get a list of recommended primitives given an entity set.

    Description:
        This function works by first getting a list of valid primitives withholding any primitives specified in `excluded_primitives` that could be applied to a single-table EntitySet.
        Secondly, engineered features are created for non-numeric fields and are checked for non-uniqueness. If the feature is non-unique, it is added to the recommendation list.
        Then, numeric fields are checked for skewness. Depending on how skew a column is `square_root` or `natural_logarithm` will be recommended.
        Lastly if `include_time_series_primitives` is specified as `True`, `Lag` will always be recommended,
        as well as all Rolling and Expanding primitives if numeric columns are present.

    Args:
        entityset (EntitySet): EntitySet that only contains one dataframe.
        include_time_series_primitives (bool): Whether or not time-series primitives should be considered. Defaults to False.
        excluded_primitives (List[str]): List of transform primitives to exclude from recommendations. Defaults to DEFAULT_EXCLUDED_PRIMITIVES.

    Note:
        The main objective of this function is to recommend primitives that could potentially provide important features to the modeling process.
        Non-numeric primitives do a great job in mainly serving as a way to extract information from origin features that may essentially be meaningless by themselves (e.g., NaturalLanguage, Datetime, LatLong).
        That is why they are the main focus of this function. Numeric transform primitives are very case-by-case dependent and therefore it is hard to mathematically quantify which should be recommended.
        Therefore, only transform primitives that address skewed numeric columns are included, as this is a standard and quantifiable transformation step. The only exception to this rule being
        for time series problems. Because there are so few primitives that are only applicable for time series, all of them are included in the recommended primitives list.

    Note:
        This function currently only works for single table and will only recommend transform primitives.
    """
    es_dataframe_list = entityset.dataframes
    if len(es_dataframe_list) == 0:
        raise IndexError("No DataFrame in EntitySet found. Please add a DataFrame.")
    if len(es_dataframe_list) > 1:
        raise IndexError(
            "Multi-table EntitySets are currently not supported. Please only use a single table EntitySet.",
        )

    target_dataframe_name = es_dataframe_list[0].ww.name

    recommended_primitives = set()

    if not include_time_series_primitives:
        excluded_primitives += TIME_SERIES_PRIMITIVES

    all_trans_primitives = get_transform_primitives()
    selected_trans_primitives = [
        p for name, p in all_trans_primitives.items() if name not in excluded_primitives
    ]

    valid_primitive_names = [
        prim.name
        for prim in get_valid_primitives(
            entityset,
            target_dataframe_name,
            1,
            selected_trans_primitives,
        )[1]
    ]

    recommended_primitives.update(
        _recommend_non_numeric_primitives(
            entityset,
            target_dataframe_name,
            valid_primitive_names,
        ),
    )

    recommended_primitives.update(
        _recommend_skew_numeric_primitives(
            entityset,
            target_dataframe_name,
            valid_primitive_names,
        ),
    )

    recommended_primitives.update(
        set(TIME_SERIES_PRIMITIVES).intersection(
            valid_primitive_names,
        ),
    )
    return list(recommended_primitives)


def _recommend_non_numeric_primitives(
    entityset: EntitySet,
    target_dataframe_name: str,
    valid_primitives: List[str],
) -> set:
    """Get a set of non-numeric primitives for a given dataset and a list of primitives.

    Description:
        Given a single table entity set with a `target_dataframe_name` and an applicable list of `valid_primitives`,
        get a set of primitives which produce non-unique features.

    Args:
        entityset (EntitySet): EntitySet that only contains one dataframe.
        target_dataframe_name (str): Name of target dataframe to access in `entityset`.
        valid_primitives (List[str]): List of primitives to calculate and check output features.
    """

    recommended_non_numeric_primitives: set[str] = set()
    # Only want to run feature generation on non numeric primitives
    numeric_columns_to_ignore = list(
        entityset[target_dataframe_name]
        .ww.select(include="numeric", return_schema=True)
        .columns,
    )
    features = dfs(
        entityset=entityset,
        target_dataframe_name=target_dataframe_name,
        trans_primitives=valid_primitives,
        max_depth=1,
        features_only=True,
        ignore_columns={target_dataframe_name: numeric_columns_to_ignore},
    )

    for f in features:
        if (
            f.primitive.name is not None
            and f.primitive.name not in recommended_non_numeric_primitives
        ):
            try:
                matrix = calculate_feature_matrix([f], entityset)
                for f_name in f.get_feature_names():
                    if len(matrix[f_name].unique()) > 1:
                        recommended_non_numeric_primitives.add(f.primitive.name)
            except (
                Exception
            ) as e:  # If error in calculating feature matrix pass on the recommendation
                logger = logging.getLogger("featuretools")
                logger.error(
                    f"Exception with feature {f.get_name()} with primitive {f.primitive.name}: {str(e)}",
                )

    return recommended_non_numeric_primitives


def _recommend_skew_numeric_primitives(
    entityset: EntitySet,
    target_dataframe_name: str,
    valid_primitives: List[str],
) -> set:
    """Get a set of recommended skew numeric primitives given an entity set.

    Description:
        Given woodwork initialized dataframe of origin features with only `numeric` semantic tags and an applicable list of `valid_skew_primitives`,
        get a set of primitives which could be applied to address right skewness.

    Args:
        entityset (EntitySet): EntitySet that only contains one dataframe.
        target_dataframe_name (str): Name of target dataframe to access in `entityset`.
        valid_primitives (List[str]): List of primitives to compare.

    Note:
        We currently only have primitives to address right skewness.
    """
    recommended_skew_primitives: set[str] = set()
    skew_numeric_primitives = set(["square_root", "natural_logarithm"])
    valid_skew_primitives = skew_numeric_primitives.intersection(valid_primitives)
    if valid_skew_primitives:
        numerics_only_df = entityset[target_dataframe_name].ww.select("numeric")
        recommended_skew_primitives: set[str] = set()
        for col in numerics_only_df:
            # Shouldn't recommend log, sqrt if nans, zeros and negative numbers are present
            contains_nan = numerics_only_df[col].isnull().any()
            all_above_zero = (numerics_only_df[col] > 0).all()
            if all_above_zero and not contains_nan:
                skew = numerics_only_df[col].skew()
                # We currently don't have anything in featuretools to automatically handle left skewed data as well as skewed data with negative values
                if skew > 0.5 and skew < 1 and "square_root" in valid_skew_primitives:
                    recommended_skew_primitives.add("square_root")
                    # TODO: Add Box Cox here when available
                if skew > 1 and "natural_logarithm" in valid_skew_primitives:
                    recommended_skew_primitives.add("natural_logarithm")
                    # TODO: Add log base 10 transform primitive when available
    return recommended_skew_primitives


================================================
FILE: featuretools/utils/s3_utils.py
================================================
import json
import shutil

from featuretools.utils.gen_utils import import_or_raise


def use_smartopen_es(file_path, path, transport_params=None, read=True):
    open = import_or_raise("smart_open", SMART_OPEN_ERR_MSG).open
    if read:
        with open(path, "rb", transport_params=transport_params) as fin:
            with open(file_path, "wb") as fout:
                shutil.copyfileobj(fin, fout)
    else:
        with open(file_path, "rb") as fin:
            with open(path, "wb", transport_params=transport_params) as fout:
                shutil.copyfileobj(fin, fout)


def use_smartopen_features(path, features_dict=None, transport_params=None, read=True):
    open = import_or_raise("smart_open", SMART_OPEN_ERR_MSG).open
    if read:
        with open(path, "r", encoding="utf-8", transport_params=transport_params) as f:
            features_dict = json.load(f)
            return features_dict
    else:
        with open(path, "w", transport_params=transport_params) as f:
            json.dump(features_dict, f)


def get_transport_params(profile_name):
    boto3 = import_or_raise("boto3", BOTO3_ERR_MSG)
    UNSIGNED = import_or_raise("botocore", BOTOCORE_ERR_MSG).UNSIGNED
    Config = import_or_raise("botocore.config", BOTOCORE_ERR_MSG).Config

    if isinstance(profile_name, str):
        session = boto3.Session(profile_name=profile_name)
        transport_params = {"client": session.client("s3")}
    elif profile_name is False or boto3.Session().get_credentials() is None:
        session = boto3.Session()
        client = session.client("s3", config=Config(signature_version=UNSIGNED))
        transport_params = {"client": client}
    else:
        transport_params = None
    return transport_params


BOTO3_ERR_MSG = (
    "The boto3 library is required to read and write from URLs and S3.\n"
    "Install via pip:\n"
    "    pip install boto3\n"
    "Install via conda:\n"
    "    conda install -c conda-forge boto3"
)
BOTOCORE_ERR_MSG = (
    "The botocore library is required to read and write from URLs and S3.\n"
    "Install via pip:\n"
    "    pip install botocore\n"
    "Install via conda:\n"
    "    conda install -c conda-forge botocore"
)
SMART_OPEN_ERR_MSG = (
    "The smart_open library is required to read and write from URLs and S3.\n"
    "Install via pip:\n"
    "    pip install 'smart-open>=5.0.0'\n"
    "Install via conda:\n"
    "    conda install -c conda-forge 'smart_open>=5.0.0'"
)


================================================
FILE: featuretools/utils/schema_utils.py
================================================
import logging
import warnings

from packaging.version import parse

from featuretools.version import ENTITYSET_SCHEMA_VERSION, FEATURES_SCHEMA_VERSION

logger = logging.getLogger("featuretools.utils")


def check_schema_version(cls, cls_type):
    """
    If the saved schema version is newer than the current featuretools
    schema version, this function will output a warning saying so.

    If the saved schema version is a major release or more behind
    the current featuretools schema version, this function will log
    a message saying so.
    """
    if isinstance(cls_type, str):
        current = None
        saved = None
        if cls_type == "entityset":
            current = ENTITYSET_SCHEMA_VERSION
            saved = cls.get("schema_version")
        elif cls_type == "features":
            current = FEATURES_SCHEMA_VERSION
            saved = cls.features_dict["schema_version"]

        if parse(current) < parse(saved):
            warning_text_upgrade = (
                "The schema version of the saved %s"
                "(%s) is greater than the latest supported (%s). "
                "You may need to upgrade featuretools. Attempting to load %s ..."
                % (cls_type, saved, current, cls_type)
            )
            warnings.warn(warning_text_upgrade)

        if parse(current).major > parse(saved).major:
            warning_text_outdated = (
                "The schema version of the saved %s"
                "(%s) is no longer supported by this version "
                "of featuretools. Attempting to load %s ..."
                % (cls_type, saved, cls_type)
            )
            logger.warning(warning_text_outdated)


================================================
FILE: featuretools/utils/time_utils.py
================================================
from datetime import datetime, timedelta

import numpy as np
import pandas as pd


def make_temporal_cutoffs(
    instance_ids,
    cutoffs,
    window_size=None,
    num_windows=None,
    start=None,
):
    """Makes a set of equally spaced cutoff times prior to a set of input cutoffs and instance ids.

    If window_size and num_windows are provided, then num_windows of size window_size will be created
    prior to each cutoff time

    If window_size and a start list is provided, then a variable number of windows will be created prior
    to each cutoff time, with the corresponding start time as the first cutoff.

    If num_windows and a start list is provided, then num_windows of variable size will be created prior
    to each cutoff time, with the corresponding start time as the first cutoff

    Args:
        instance_ids (list, np.ndarray, or pd.Series): list of instance ids. This function will make a
            new datetime series of multiple cutoff times for each value in this array.
        cutoffs (list, np.ndarray, or pd.Series): list of datetime objects associated with each instance id.
            Each one of these will be the last time in the new datetime series for each instance id
        window_size (pd.Timedelta, optional): amount of time between each datetime in each new cutoff series
        num_windows (int, optional): number of windows in each new cutoff series
        start (list, optional): list of start times for each instance id
    """
    if window_size is not None and num_windows is not None and start is not None:
        raise ValueError(
            "Only supply 2 of the 3 optional args, window_size, num_windows and start",
        )
    out = []
    for i, id_time in enumerate(zip(instance_ids, cutoffs)):
        _id, time = id_time
        _window_size = window_size
        _start = None
        if start is not None:
            if window_size is None:
                _window_size = (time - start[i]) / (num_windows - 1)
            else:
                _start = start[i]
        to_add = pd.DataFrame()
        to_add["time"] = pd.date_range(
            end=time,
            periods=num_windows,
            freq=_window_size,
            start=_start,
        )
        to_add["instance_id"] = [_id] * len(to_add["time"])
        out.append(to_add)
    return pd.concat(out).reset_index(drop=True)


def convert_time_units(secs, unit):
    """
    Converts a time specified in seconds to a time in the given units

    Args:
        secs (integer): number of seconds. This function will convert the units of this number.
        unit(str): units to be converted to.
            acceptable values: years, months, days, hours, minutes, seconds, milliseconds, nanoseconds
    """
    unit_divs = {
        "years": 31540000,
        "months": 2628000,
        "days": 86400,
        "hours": 3600,
        "minutes": 60,
        "seconds": 1,
        "milliseconds": 0.001,
        "nanoseconds": 0.000000001,
    }
    if unit not in unit_divs:
        raise ValueError("Invalid unit given, make sure it is plural")

    return secs / (unit_divs[unit])


def convert_datetime_to_floats(x):
    first = int(x.iloc[0].value * 1e-9)
    x = pd.to_numeric(x).astype(np.float64).values
    dividend = find_dividend_by_unit(first)
    x *= 1e-9 / dividend
    return x


def convert_timedelta_to_floats(x):
    first = int(x.iloc[0].total_seconds())
    dividend = find_dividend_by_unit(first)
    x = pd.TimedeltaIndex(x).total_seconds().astype(np.float64) / dividend
    return x


def find_dividend_by_unit(time):
    """Finds whether time best corresponds to a value in
    days, hours, minutes, or seconds.
    """
    for dividend in [86400, 3600, 60]:
        div = time / dividend
        if round(div) == div:
            return dividend
    return 1


def calculate_trend(series):
    # numpy can't handle `Int64` values, so cast to float
    if series.dtype == "Int64":
        series = series.astype("float64")
    df = pd.DataFrame({"x": series.index, "y": series.values}).dropna()
    if df.shape[0] <= 2:
        return np.nan
    if isinstance(df["x"].iloc[0], (datetime, pd.Timestamp)):
        x = convert_datetime_to_floats(df["x"])
    else:
        x = df["x"].values

    if isinstance(df["y"].iloc[0], (datetime, pd.Timestamp)):
        y = convert_datetime_to_floats(df["y"])
    elif isinstance(df["y"].iloc[0], (timedelta, pd.Timedelta)):
        y = convert_timedelta_to_floats(df["y"])
    else:
        y = df["y"].values

    x = x - x.mean()
    y = y - y.mean()

    # prevent divide by zero error
    if len(np.unique(x)) == 1:
        return 0

    # consider scipy.stats.linregress for large n cases
    coefficients = np.polyfit(x, y, 1)
    return coefficients[0]


================================================
FILE: featuretools/utils/trie.py
================================================
class Trie(object):
    """
    A trie (prefix tree) where the keys are sequences of hashable objects.

    It behaves similarly to a dictionary, except that the keys can be lists or
    other sequences.

    Examples:
        >>> from featuretools.utils import Trie
        >>> trie = Trie(default=str)
        >>> # Set a value
        >>> trie.get_node([1, 2, 3]).value = '123'
        >>> # Get a value
        >>> trie.get_node([1, 2, 3]).value
        '123'
        >>> # Overwrite a value
        >>> trie.get_node([1, 2, 3]).value = 'updated'
        >>> trie.get_node([1, 2, 3]).value
        'updated'
        >>> # Getting a key that has not been set returns the default value.
        >>> trie.get_node([1, 2]).value
        ''
    """

    def __init__(self, default=lambda: None, path_constructor=list):
        """
        default: A function returning the value to use for new nodes.
        path_constructor: A function which constructs a path from a list. The
            path type must support addition (concatenation).
        """
        self.value = default()
        self._children = {}
        self._default = default
        self._path_constructor = path_constructor

    def children(self):
        """
        A list of pairs of the edges from this node and the nodes they point
        to.

        Examples:
            >>> from featuretools.utils import Trie
            >>> trie = Trie(default=str)
            >>> trie.get_node([1, 2]).value = '12'
            >>> trie.get_node([3]).value = '3'
            >>> children = trie.children()
            >>> first_edge, first_child = children[0]
            >>> first_edge
            1
            >>> first_child.value
            ''
            >>> second_edge, second_child = children[1]
            >>> second_edge
            3
            >>> second_child.value
            '3'
        """
        return list(self._children.items())

    def get_node(self, path):
        """
        Get the sub-trie at the given path. If it does not yet exist initialize
        it with the default value.

        Examples:
            >>> from featuretools.utils import Trie
            >>> t = Trie()
            >>> t.get_node([1, 2, 3]).value = '123'
            >>> t.get_node([1, 2, 4]).value = '124'
            >>> sub = t.get_node([1, 2])
            >>> sub.get_node([3]).value
            '123'
            >>> sub.get_node([4]).value
            '124'
        """
        if path:
            first = path[0]
            rest = path[1:]

            if first in self._children:
                sub_trie = self._children[first]
            else:
                sub_trie = Trie(
                    default=self._default,
                    path_constructor=self._path_constructor,
                )
                self._children[first] = sub_trie

            return sub_trie.get_node(rest)
        else:
            return self

    def __iter__(self):
        """
        Iterate over all values in the trie. Yields tuples of (path, value).

        Implemented using depth first search.
        """
        yield self._path_constructor([]), self.value

        for key, sub_trie in self.children():
            path_to_children = self._path_constructor([key])

            for sub_path, value in sub_trie:
                path = path_to_children + sub_path
                yield path, value


================================================
FILE: featuretools/utils/utils_info.py
================================================
import locale
import os
import platform
import struct
import sys

import pkg_resources

import featuretools

deps = [
    "numpy",
    "pandas",
    "tqdm",
    "cloudpickle",
    "dask",
    "distributed",
    "psutil",
    "pip",
    "setuptools",
]


def show_info():
    print("Featuretools version: %s" % featuretools.__version__)
    print("Featuretools installation directory: %s" % get_featuretools_root())
    print_sys_info()
    print_deps(deps)


def print_sys_info():
    print("\nSYSTEM INFO")
    print("-----------")
    sys_info = get_sys_info()
    for k, stat in sys_info:
        print("{k}: {stat}".format(k=k, stat=stat))


def print_deps(dependencies):
    print("\nINSTALLED VERSIONS")
    print("------------------")
    installed_packages = get_installed_packages()

    package_dep = []
    for x in dependencies:
        # prevents uninstalled deps from being printed
        if x in installed_packages:
            package_dep.append((x, installed_packages[x]))
    for k, stat in package_dep:
        print("{k}: {stat}".format(k=k, stat=stat))


# Modified from here
# https://github.com/pandas-dev/pandas/blob/d9a037ec4ad0aab0f5bf2ad18a30554c38299e57/pandas/util/_print_versions.py#L11
def get_sys_info():
    "Returns system information as a dict"

    blob = []

    try:
        (sysname, nodename, release, version, machine, processor) = platform.uname()
        blob.extend(
            [
                ("python", ".".join(map(str, sys.version_info))),
                ("python-bits", struct.calcsize("P") * 8),
                ("OS", "{sysname}".format(sysname=sysname)),
                ("OS-release", "{release}".format(release=release)),
                ("machine", "{machine}".format(machine=machine)),
                ("processor", "{processor}".format(processor=processor)),
                ("byteorder", "{byteorder}".format(byteorder=sys.byteorder)),
                ("LC_ALL", "{lc}".format(lc=os.environ.get("LC_ALL", "None"))),
                ("LANG", "{lang}".format(lang=os.environ.get("LANG", "None"))),
                ("LOCALE", ".".join(map(str, locale.getlocale()))),
            ],
        )
    except (KeyError, ValueError):
        pass

    return blob


def get_installed_packages():
    installed_packages = {}
    for d in pkg_resources.working_set:
        installed_packages[d.project_name] = d.version
    return installed_packages


def get_featuretools_root():
    return os.path.dirname(featuretools.__file__)


================================================
FILE: featuretools/utils/wrangle.py
================================================
import re
import tarfile
from datetime import datetime

import numpy as np
import pandas as pd
from woodwork.logical_types import Datetime, Ordinal

from featuretools.entityset.timedelta import Timedelta


def _check_timedelta(td):
    """
    Convert strings to Timedelta objects
    Allows for both shortform and longform units, as well as any form of capitalization
    '2 Minutes'
    '2 minutes'
    '2 m'
    '1 Minute'
    '1 minute'
    '1 m'
    '1 units'
    '1 Units'
    '1 u'
    Shortform is fine if space is dropped
    '2m'
    '1u"
    If a pd.Timedelta object is passed, units will be converted to seconds due to the underlying representation
        of pd.Timedelta.
    If a pd.DateOffset object is passed, it will be converted to a Featuretools Timedelta if it has one
        temporal parameter. Otherwise, it will remain a pd.DateOffset.
    """
    if td is None:
        return td
    if isinstance(td, Timedelta):
        return td
    elif not isinstance(td, (int, float, str, pd.DateOffset, pd.Timedelta)):
        raise ValueError("Unable to parse timedelta: {}".format(td))
    if isinstance(td, pd.Timedelta):
        unit = "s"
        value = td.total_seconds()
        times = {unit: value}
        return Timedelta(times, delta_obj=td)
    elif isinstance(td, pd.DateOffset):
        # DateOffsets
        if td.__class__.__name__ != "DateOffset":
            if hasattr(td, "__dict__"):
                # Special offsets (such as BDay) - prior to pandas 1.0.0
                value = td.__dict__["n"]
            else:
                # Special offsets (such as BDay) - after pandas 1.0.0
                value = td.n
            unit = td.__class__.__name__
            times = dict([(unit, value)])
        else:
            times = dict()
            for td_unit, td_value in td.kwds.items():
                times[td_unit] = td_value
        return Timedelta(times, delta_obj=td)
    else:
        pattern = "([0-9]+) *([a-zA-Z]+)$"
        match = re.match(pattern, td)
        value, unit = match.groups()
        try:
            value = int(value)
        except Exception:
            try:
                value = float(value)
            except Exception:
                raise ValueError(
                    "Unable to parse value {} from ".format(value)
                    + "timedelta string: {}".format(td),
                )
        times = {unit: value}
        return Timedelta(times)


def _check_time_against_column(time, time_column):
    """
    Check to make sure that time is compatible with time_column,
    where time could be a timestamp, or a Timedelta, number, or None,
    and time_column is a Woodwork initialized column. Compatibility means that
    arithmetic can be performed between time and elements of time_column

    If time is None, then we don't care if arithmetic can be performed
    (presumably it won't ever be performed)
    """
    if time is None:
        return True
    elif isinstance(time, (int, float)):
        return time_column.ww.schema.is_numeric
    elif isinstance(time, (pd.Timestamp, datetime, pd.DateOffset)):
        return time_column.ww.schema.is_datetime
    elif isinstance(time, Timedelta):
        if time_column.ww.schema.is_datetime:
            return True
        elif time.unit not in Timedelta._time_units:
            if (
                isinstance(time_column.ww.logical_type, Ordinal)
                or "numeric" in time_column.ww.semantic_tags
                or "time_index" in time_column.ww.semantic_tags
            ):
                return True
    return False


def _check_time_type(time):
    """
    Checks if `time` is an instance of common int, float, or datetime types.
    Returns "numeric" or Datetime based on results
    """
    time_type = None
    if isinstance(time, (datetime, np.datetime64)):
        time_type = Datetime
    elif (
        isinstance(time, (int, float))
        or np.issubdtype(time, np.integer)
        or np.issubdtype(time, np.floating)
    ):
        time_type = "numeric"
    return time_type


def _is_s3(string):
    """
    Checks if the given string is a s3 path.
    Returns a boolean.
    """
    return string.startswith("s3://")


def _is_url(string):
    """
    Checks if the given string is an url path.
    Returns a boolean.
    """
    return string.startswith("http")


def _is_local_tar(string):
    """
    Checks if the given string is a local tarfile path.
    Returns a boolean.
    """
    return string.endswith(".tar") and tarfile.is_tarfile(string)


================================================
FILE: featuretools/version.py
================================================
__version__ = "1.31.0"
ENTITYSET_SCHEMA_VERSION = "9.0.0"
FEATURES_SCHEMA_VERSION = "10.0.0"


================================================
FILE: pyproject.toml
================================================
[project]
name = "featuretools"
readme = "README.md"
description = "a framework for automated feature engineering"
dynamic = ["version"]
classifiers = [
    "Development Status :: 5 - Production/Stable",
    "Intended Audience :: Science/Research",
    "Intended Audience :: Developers",
    "Topic :: Software Development",
    "Topic :: Scientific/Engineering",
    "Programming Language :: Python",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3.9",
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
    "Programming Language :: Python :: 3.12",
    "Operating System :: Microsoft :: Windows",
    "Operating System :: POSIX",
    "Operating System :: Unix",
    "Operating System :: MacOS",
]
authors = [
    {name="Alteryx, Inc.", email="open_source_support@alteryx.com"}
]
maintainers = [
    {name="Alteryx, Inc.", email="open_source_support@alteryx.com"}
]
keywords = ["feature engineering", "data science", "machine learning"]
license = {text = "BSD 3-clause"}
requires-python = ">=3.9,<4"
dependencies = [
    "cloudpickle >= 1.5.0",
    "holidays >= 0.17",
    "numpy >= 1.25.0, < 2.0.0",
    "packaging >= 20.0",
    "pandas >= 2.0.0",
    "psutil >= 5.7.0",
    "scipy >= 1.10.0",
    "tqdm >= 4.66.3",
    "woodwork >= 0.28.0",
]

[project.urls]
"Documentation" = "https://featuretools.alteryx.com"
"Source Code"= "https://github.com/alteryx/featuretools/"
"Changes" = "https://featuretools.alteryx.com/en/latest/release_notes.html"
"Issue Tracker" = "https://github.com/alteryx/featuretools/issues"
"Twitter" = "https://twitter.com/alteryxoss"
"Chat" = "https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA"

[project.optional-dependencies]
test = [
    "boto3 >= 1.34.32",
    "composeml >= 0.8.0",
    "graphviz >= 0.8.4",
    "moto[all] >= 5.0.0",
    "pip >= 23.3.0",
    "pyarrow >= 14.0.1",
    "pympler >= 0.8",
    "pytest >= 7.1.2",
    "pytest-cov >= 3.0.0",
    "pytest-xdist >= 2.5.0",
    "smart-open >= 5.0.0",
    "urllib3 >= 1.26.18",
    "pytest-timeout >= 2.1.0",
]
dask = [
    "dask[dataframe] >= 2023.2.0",
    "distributed >= 2023.2.0",
]
tsfresh = [
    "featuretools-tsfresh-primitives >= 1.0.0",
]
autonormalize = [
    "autonormalize >= 2.0.1",
]
sql = [
    "featuretools_sql >= 0.0.1",
    "psycopg2-binary >= 2.9.3",
]
sklearn = [
    "featuretools-sklearn-transformer >= 1.0.0",
]
premium = [
    "premium-primitives >= 0.0.3",
]
nlp = [
    "nlp-primitives >= 2.12.0",
]
docs = [
    "ipython == 8.4.0",
    "jupyter == 1.0.0",
    "jupyter-client >= 8.0.2",
    "matplotlib == 3.7.2",
    "Sphinx == 5.1.1",
    "nbsphinx == 0.8.9",
    "nbconvert == 6.5.0",
    "pydata-sphinx-theme == 0.9.0",
    "sphinx-inline-tabs == 2022.1.2b11",
    "sphinx-copybutton == 0.5.0",
    "myst-parser == 0.18.0",
    "autonormalize >= 2.0.1",
    "click >= 7.0.0",
    "featuretools[dask,test]",
]
dev = [
    "ruff >= 0.1.6",
    "black[jupyter] >= 23.1.0",
    "pre-commit >= 2.20.0",
    "featuretools[docs,dask,test]",
]
complete = [
    "featuretools[premium,nlp,dask]",
]

[tool.setuptools]
include-package-data = true
license-files = [
    "LICENSE",
    "featuretools/primitives/data/free_email_provider_domains_license"
]

[tool.setuptools.packages.find]
namespaces = true

[tool.setuptools.package-data]
"*" = [
    "*.txt",
    "README.md",
]
"featuretools" = [
    "primitives/data/*.csv",
    "primitives/data/*.txt",
]

[tool.setuptools.exclude-package-data]
"*" = [
    "* __pycache__",
    "*.py[co]",
    "docs/*"
]

[tool.setuptools.dynamic]
version = {attr = "featuretools.version.__version__"}

[tool.pytest.ini_options]
addopts = "--doctest-modules --ignore=featuretools/tests/entry_point_tests/add-ons"
testpaths = [
    "featuretools/tests/*"
]
filterwarnings = [
    "ignore::DeprecationWarning",
    "ignore::PendingDeprecationWarning"
]

[tool.ruff]
line-length = 88
target-version = "py311"
lint.ignore = ["E501"]
lint.select = [
    # Pyflakes
    "F",
    # Pycodestyle
    "E",
    "W",
    # isort
    "I001"
]
src = ["featuretools"]

[tool.ruff.lint.per-file-ignores]
"__init__.py" = ["E402", "F401", "I001", "E501"]

[tool.ruff.lint.isort]
known-first-party = ["featuretools"]

[tool.coverage.run]
source = ["featuretools"]
omit = [
    "*/add-ons/**/*"
]

[tool.coverage.report]
exclude_lines =[
    "pragma: no cover",
    "def __repr__",
    "raise AssertionError",
    "raise NotImplementedError",
    "if __name__ == .__main__.:",
    "if self._verbose:",
    "if verbose:",
    "if profile:",
    "pytest.skip"
]
[build-system]
requires = [
    "setuptools >= 61.0.0",
    "wheel"
]
build-backend = "setuptools.build_meta"


================================================
FILE: release.md
================================================
# Release Process

## 0. Pre-Release Checklist

Before starting the release process, verify the following:

- All work required for this release has been completed and the team is ready to release.
- [All Github Actions Tests are green on main](https://github.com/alteryx/featuretools/actions?query=branch%3Amain).
- EvalML Tests are green with Featuretools main
  - [![Unit Tests - EvalML with Featuretools main branch](https://github.com/alteryx/evalml/actions/workflows/unit_tests_with_featuretools_main_branch.yaml/badge.svg?branch=main)](https://github.com/alteryx/evalml/actions/workflows/unit_tests_with_featuretools_main_branch.yaml)
- Looking Glass performance tests runs should not show any significant performance regressions when comparing the last commit on `main` with the previous release of Featuretools. See Step 1 below for instructions on manually launching the performance tests runs.
- The [ReadtheDocs build](https://readthedocs.com/projects/feature-labs-inc-featuretools/) for "latest" is marked as passed. To avoid mysterious errors, best practice is to empty your browser cache when reading new versions of the docs!
- The [public documentation for the "latest" branch](https://featuretools.alteryx.com/en/latest/) looks correct, and the [release notes](https://featuretools.alteryx.com/en/latest/release_notes.html) includes the last change which was made on `main`.
- Get agreement on the version number to use for the release.

#### Version Numbering

Featuretools uses [semantic versioning](https://semver.org/). Every release has a major, minor and patch version number, and are displayed like so: `<majorVersion>.<minorVersion>.<patchVersion>`.

In certain instances, it may be necessary to create a backport release. This is when commits from a newer version of a library are ported to an older version of the software and then released. This occurs when anything but the latest commit on main is used as the target for release, but can go so far as to add a further patch release, such as 0.11.2, to be released after a 0.12.0 version had already been released. If a backport release is being performed, please see the [Backport Release Guide](docs/backport_release.md) for instructions on how to proceed, as some steps from this guide should be performed differently.

If you'd like to create a development release, which won't be deployed to pypi and conda and marked as a generally-available production release, please add a "dev" prefix to the patch version, i.e. `X.X.devX`. Note this claims the patch number--if the previous release was `0.12.0`, a subsequent dev release would be `0.12.dev1`, and the following release would be `0.12.2`, _not_ `0.12.1`. Development releases deploy to [test.pypi.org](https://test.pypi.org/project/featuretools/) instead of to [pypi.org](https://pypi.org/project/featuretools).

## 1. Evaluate Performance Test Results

Before releasing Featuretools, the person performing the release should launch a performance test run and evaluate the results to make sure no significant performance regressions will be introduced by the release. This can be done by launching a Looking Glass performance test run, which will then post results to Slack. 

To manually launch a Looking Glass performance test run, follow these steps:
1. Navigate to the [Looking Glass performance tests](https://github.com/alteryx/featuretools/actions/workflows/looking_glass_performance_tests.yaml) GitHub action
2. Click on the Run workflow dropdown to set up the run
3. Make sure that the "use workflow from" dropdown is set to `main` to use the workflow version in Featuretools `main`
4. Enter the hash of the most recent commit to `main` in the "new commit to evaluate" field. For example: `cee9607`
5. Enter the version tag of the last release of Featuretools in the "previous commit to evaluate" field. For example, if the last release of Featuretools was version 1.20.0, you would enter `v1.20.0` here.
6. Click the "Run workflow" button to launch the jobs

Once the job has been completed, the results summaries will be posted to Slack automatically. Review the results and make sure the performance has not degraded. If any significant performance issues are noted, discuss with the development team before proceeding.

Note: The procedure above can also be used to launch performance tests runs at any time, even outside of the release process. When launching a test run, the commit fields can take any commit hash, GitHub branch or tag as input to specify the new and previous commits to compare.

## 2. Create Featuretools release on Github

#### Create Release Branch

1. Branch off of featuretools main. For the branch name, please use "release_vX.Y.Z" as the naming scheme (e.g. "release_v0.13.3"). Doing so will bypass our release notes checkin test which requires all other PRs to add a release note entry.

#### Bump Version Number

1. Bump `__version__` in `featuretools/version.py`, and `featuretools/tests/test_version.py`.

#### Update Release Notes

1. Replace "Future Release" in `docs/source/release_notes.rst` with the current date

   ```
   v0.13.3 Sep 28, 2020
   ====================
   ```

2. Remove any unused Release Notes sections for this release (e.g. Fixes, Testing Changes)
3. Add yourself to the list of contributors to this release and **put the contributors in alphabetical order**
4. The release PR does not need to be mentioned in the list of changes
5. Add a commented out "Future Release" section with all of the Release Notes sections above the current section

   ```
   .. Future Release
     ==============
       * Enhancements
       * Fixes
       * Changes
       * Documentation Changes
       * Testing Changes

   .. Thanks to the following people for contributing to this release:
   ```

#### Create Release PR

A [release pr](https://github.com/alteryx/featuretools/pull/856) should have **the version number as the title** and the release notes for that release as the PR body text. The contributors list is not necessary. The special sphinx docs syntax (:pr:\`547\`) needs to be changed to github link syntax (#547).

Checklist before merging:

- The title of the PR is the version number.
- All tests are currently green on checkin and on `main`.
- The ReadtheDocs build for the release PR branch has passed, and the resulting docs contain the expected release notes.
- PR has been reviewed and approved.
- Confirm with the team that `main` will be frozen until step 3 (Github Release) is complete.

After merging, verify again that ReadtheDocs "latest" is correct.

## 3. Create Github Release

After the release pull request has been merged into the `main` branch, it is time draft the github release. [Example release](https://github.com/alteryx/featuretools/releases/tag/v0.13.3)

- The target should be the `main` branch
- The tag should be the version number with a v prefix (e.g. v0.13.3)
- Release title is the same as the tag
- Release description should be the full Release Notes updates for the release, including the line thanking contributors. Contributors should also have their links changed from the docs syntax (:user:\`gsheni\`) to github syntax (@gsheni)
- This is not a pre-release
- Publishing the release will automatically upload the package to PyPI

## 4. Release on conda-forge

In order to release on conda-forge, you can either wait for a bot to create a pull request, or use a GitHub Actions workflow

### Option a: Use a GitHub Action workflow

1. After the package has been uploaded on PyPI, the **Create Feedstock Pull Request** workflow should automatically kickoff a job. 
    * If it does not, go [here](https://github.com/alteryx/featuretools/actions/workflows/create_feedstock_pr.yaml)
    * Click **Run workflow** and input the letter `v` followed by the release version (e.g. `v0.13.3`)
    * Kickoff the GitHub Action, and monitor the Job Summary.
2. Once the job has been completed, you will see summary output, with a URL. 
    * Visit that URL and create a pull request.
    * Alternatively, create the pull request by clicking the branch name (e.g. - `v0.13.3`): 
      - https://github.com/alteryx/featuretools-feedstock/branches
3. Verify that the PR has the following: 
    * The `build['number']` is 0 (in __recipe/meta.yml__).
    * The `requirements['run']` (in __recipe/meta.yml__) matches the `[project]['dependencies']` in __featuretools/pyproject.toml__.
    * The `test['requires']` (in __recipe/meta.yml__) matches the `[project.optional-dependencies]['test']` in __featuretools/pyproject.toml__
    > There will be 2 entries for graphviz: `graphviz` and `python-graphviz`. 
    > Make sure `python-graphviz` (in __recipe/meta.yml__) matches `graphviz` in `[project.optional-dependencies]['test']` in __featuretools/pyproject.toml__.
4. Satisfy the conditions in pull request description and **merge it if the CI passes**. 

### Option b: Waiting for bot to create new PR

1. A bot should automatically create a new PR in [conda-forge/featuretools-feedstock](https://github.com/conda-forge/featuretools-feedstock/pulls) - note, the PR may take up to a few hours to be created
2. Update requirements changes in `recipe/meta.yaml` (bot should have handled version and source links on its own)
3. After tests pass, a maintainer will merge the PR in

# Miscellaneous
## Add new maintainers to featuretools-feedstock

Per the instructions [here](https://conda-forge.org/docs/maintainer/updating_pkgs.html#updating-the-maintainer-list):
1. Ask an existing maintainer to create an issue on the [repo](https://github.com/conda-forge/featuretools-feedstock).
  a. Select *Bot commands* and put the following title (change `username`):

  ```text
  @conda-forge-admin, please add user @username
  ```

2. A PR will be auto-created on the repo, and will need to be merged by an existing maintainer.
3. The new user will need to **check their email for an invite link to click**, which should be https://github.com/conda-forge