Merge pull request #7495 from th0rntwig/train-test-shuffle

Set train-test-split parameters shuffle=False as default and remove stratification
2022-10-01 14:52:14 +02:00
parent 6702a1b219 51556e08c3
commit 84b822dbf1
5 changed files with 7 additions and 30 deletions
@@ -27,8 +27,7 @@ Mandatory parameters are marked as **Required** and have to be set in one of the
 | `weight_factor` | Weight training data points according to their recency (see details [here](freqai-feature-engineering.md#weighting-features-for-temporal-importance)). <br> **Datatype:** Positive float (typically < 1).
 | `indicator_max_period_candles` | **No longer used (#7325)**. Replaced by `startup_candle_count` which is set in the [strategy](freqai-configuration.md#building-a-freqai-strategy). `startup_candle_count` is timeframe independent and defines the maximum *period* used in `populate_any_indicators()` for indicator creation. `FreqAI` uses this parameter together with the maximum timeframe in `include_time_frames` to calculate how many data points to download such that the first data point does not include a NaN <br> **Datatype:** Positive integer.
 | `indicator_periods_candles` | Time periods to calculate indicators for. The indicators are added to the base indicator dataset. <br> **Datatype:** List of positive integers.
-| `stratify_training_data` | Split the feature set into training and testing datasets. For example, `stratify_training_data: 2` would set every 2nd data point into a separate dataset to be pulled from during training/testing. See details about how it works [here](freqai-running.md#data-stratification-for-training-and-testing-the-model). <br> **Datatype:** Positive integer.
-| `principal_component_analysis` | Automatically reduce the dimensionality of the data set using Principal Component Analysis. See details about how it works [here](#reducing-data-dimensionality-with-principal-component-analysis) <br> **Datatype:** Boolean. defaults to `false`.
+| `principal_component_analysis` | Automatically reduce the dimensionality of the data set using Principal Component Analysis. See details about how it works [here](#reducing-data-dimensionality-with-principal-component-analysis) <br> **Datatype:** Boolean. defaults to `False`.
 | `plot_feature_importances` | Create a feature importance plot for each model for the top/bottom `plot_feature_importances` number of features.<br> **Datatype:** Integer, defaults to `0`.
 | `DI_threshold` | Activates the use of the Dissimilarity Index for outlier detection when set to > 0. See details about how it works [here](freqai-feature-engineering.md#identifying-outliers-with-the-dissimilarity-index-di). <br> **Datatype:** Positive float (typically < 1).
 | `use_SVM_to_remove_outliers` | Train a support vector machine to detect and remove outliers from the training dataset, as well as from incoming data points. See details about how it works [here](freqai-feature-engineering.md#identifying-outliers-using-a-support-vector-machine-svm). <br> **Datatype:** Boolean.
@@ -41,7 +40,7 @@ Mandatory parameters are marked as **Required** and have to be set in one of the
 |  |  **Data split parameters**
 | `data_split_parameters` | Include any additional parameters available from Scikit-learn `test_train_split()`, which are shown [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) (external website). <br> **Datatype:** Dictionary.
 | `test_size` | The fraction of data that should be used for testing instead of training. <br> **Datatype:** Positive float < 1.
-| `shuffle` | Shuffle the training data points during training. Typically, for time-series forecasting, this is set to `False`. <br> **Datatype:** Boolean.
+| `shuffle` | Shuffle the training data points during training. Typically, to not remove the chronological order of data in time-series forecasting, this is set to `False`. <br> **Datatype:** Boolean. <br> Defaut: `False`.
 |  |  **Model training parameters**
 | `model_training_parameters` | A flexible dictionary that includes all parameters available by the selected model library. For example, if you use `LightGBMRegressor`, this dictionary can contain any parameter available by the `LightGBMRegressor` [here](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html) (external website). If you select a different model, this dictionary can contain any parameter from that model.  <br> **Datatype:** Dictionary.
 | `n_estimators` | The number of boosted trees to fit in regression. <br> **Datatype:** Integer.
@@ -105,23 +105,6 @@ During dry/live mode, FreqAI trains each coin pair sequentially (on separate thr

 In the presented example config, the user will only allow predictions on models that are less than 1/2 hours old.

-## Data stratification for training and testing the model
-
-You can stratify (group) the training/testing data using:
-
-```json
-    "freqai": {
-        "feature_parameters" : {
-            "stratify_training_data": 3
-        }
-    }
-```
-
-This will split the data chronologically so that every Xth data point is used to test the model after training. In the example above, the user is asking for every third data point in the dataframe to be used for
-testing; the other points are used for training.
-
-The test data is used to evaluate the performance of the model after training. If the test score is high, the model is able to capture the behavior of the data well. If the test score is low, either the model does not capture the complexity of the data, the test data is significantly different from the train data, or a different type of model should be used.
-
 ## Controlling the model learning process

 Model training parameters are unique to the selected machine learning library. FreqAI allows you to set any parameter for any library using the `model_training_parameters` dictionary in the config. The example config (found in `config_examples/config_freqai.example.json`) shows some of the example parameters associated with `Catboost` and `LightGBM`, but you can add any parameters available in those libraries or any other machine learning library you choose to implement.
@@ -567,6 +567,7 @@ CONF_SCHEMA = {
                    "properties": {
                        "test_size": {"type": "number"},
                        "random_state": {"type": "integer"},
+                        "shuffle": {"type": "boolean", "default": False}
                    },
                },
                "model_training_parameters": {
@@ -134,20 +134,15 @@ class FreqaiDataKitchen:
        """
        feat_dict = self.freqai_config["feature_parameters"]

+        if 'shuffle' not in self.freqai_config['data_split_parameters']:
+            self.freqai_config["data_split_parameters"].update({'shuffle': False})
+
        weights: npt.ArrayLike
        if feat_dict.get("weight_factor", 0) > 0:
            weights = self.set_weights_higher_recent(len(filtered_dataframe))
        else:
            weights = np.ones(len(filtered_dataframe))

-        if feat_dict.get("stratify_training_data", 0) > 0:
-            stratification = np.zeros(len(filtered_dataframe))
-            for i in range(1, len(stratification)):
-                if i % feat_dict.get("stratify_training_data", 0) == 0:
-                    stratification[i] = 1
-        else:
-            stratification = None
-
        if self.freqai_config.get('data_split_parameters', {}).get('test_size', 0.1) != 0:
            (
                train_features,
@@ -160,7 +155,6 @@ class FreqaiDataKitchen:
                filtered_dataframe[: filtered_dataframe.shape[0]],
                labels,
                weights,
-                stratify=stratification,
                **self.config["freqai"]["data_split_parameters"],
            )
        else:
@@ -86,7 +86,7 @@ def test_use_SVM_to_remove_outliers_and_outlier_protection(mocker, freqai_conf,
    freqai_conf['freqai']['feature_parameters'].update({"outlier_protection_percentage": 0.1})
    freqai.dk.use_SVM_to_remove_outliers(predict=False)
    assert log_has_re(
-        "SVM detected 8.09%",
+        "SVM detected 8.66%",
        caplog,
    )