Set train-test-split shuffle=False as default and remove stratification

2022-09-28 18:23:56 +02:00
parent fb3d408338
commit 683b084323
3 changed files with 4 additions and 28 deletions
@@ -27,8 +27,7 @@ Mandatory parameters are marked as **Required** and have to be set in one of the
 | `weight_factor` | Weight training data points according to their recency (see details [here](freqai-feature-engineering.md#weighting-features-for-temporal-importance)). <br> **Datatype:** Positive float (typically < 1).
 | `indicator_max_period_candles` | **No longer used (#7325)**. Replaced by `startup_candle_count` which is set in the [strategy](freqai-configuration.md#building-a-freqai-strategy). `startup_candle_count` is timeframe independent and defines the maximum *period* used in `populate_any_indicators()` for indicator creation. `FreqAI` uses this parameter together with the maximum timeframe in `include_time_frames` to calculate how many data points to download such that the first data point does not include a NaN <br> **Datatype:** Positive integer.
 | `indicator_periods_candles` | Time periods to calculate indicators for. The indicators are added to the base indicator dataset. <br> **Datatype:** List of positive integers.
-| `stratify_training_data` | Split the feature set into training and testing datasets. For example, `stratify_training_data: 2` would set every 2nd data point into a separate dataset to be pulled from during training/testing. See details about how it works [here](freqai-running.md#data-stratification-for-training-and-testing-the-model). <br> **Datatype:** Positive integer.
+| `principal_component_analysis` | Automatically reduce the dimensionality of the data set using Principal Component Analysis. See details about how it works [here](#reducing-data-dimensionality-with-principal-component-analysis) <br> **Datatype:** Boolean. defaults to `False`.
 | `principal_component_analysis` | Automatically reduce the dimensionality of the data set using Principal Component Analysis. See details about how it works [here](#reducing-data-dimensionality-with-principal-component-analysis) <br> **Datatype:** Boolean. defaults to `false`.
 | `plot_feature_importances` | Create a feature importance plot for each model for the top/bottom `plot_feature_importances` number of features.<br> **Datatype:** Integer, defaults to `0`.
 | `DI_threshold` | Activates the use of the Dissimilarity Index for outlier detection when set to > 0. See details about how it works [here](freqai-feature-engineering.md#identifying-outliers-with-the-dissimilarity-index-di). <br> **Datatype:** Positive float (typically < 1).
 | `use_SVM_to_remove_outliers` | Train a support vector machine to detect and remove outliers from the training dataset, as well as from incoming data points. See details about how it works [here](freqai-feature-engineering.md#identifying-outliers-using-a-support-vector-machine-svm). <br> **Datatype:** Boolean.
@@ -105,23 +105,6 @@ During dry/live mode, FreqAI trains each coin pair sequentially (on separate thr
 In the presented example config, the user will only allow predictions on models that are less than 1/2 hours old.
 ## Data stratification for training and testing the model
 You can stratify (group) the training/testing data using:
 ```json
    "freqai": {
        "feature_parameters" : {
            "stratify_training_data": 3
        }
    }
 ```
 This will split the data chronologically so that every Xth data point is used to test the model after training. In the example above, the user is asking for every third data point in the dataframe to be used for
 testing; the other points are used for training.
 The test data is used to evaluate the performance of the model after training. If the test score is high, the model is able to capture the behavior of the data well. If the test score is low, either the model does not capture the complexity of the data, the test data is significantly different from the train data, or a different type of model should be used.
 ## Controlling the model learning process
 Model training parameters are unique to the selected machine learning library. FreqAI allows you to set any parameter for any library using the `model_training_parameters` dictionary in the config. The example config (found in `config_examples/config_freqai.example.json`) shows some of the example parameters associated with `Catboost` and `LightGBM`, but you can add any parameters available in those libraries or any other machine learning library you choose to implement.
@@ -134,20 +134,14 @@ class FreqaiDataKitchen:
        """
        feat_dict = self.freqai_config["feature_parameters"]
        shuffle = self.freqai_config.get('data_split_parameters', {}).get('shuffle', False)
        weights: npt.ArrayLike
        if feat_dict.get("weight_factor", 0) > 0:
            weights = self.set_weights_higher_recent(len(filtered_dataframe))
        else:
            weights = np.ones(len(filtered_dataframe))
        if feat_dict.get("stratify_training_data", 0) > 0:
            stratification = np.zeros(len(filtered_dataframe))
            for i in range(1, len(stratification)):
                if i % feat_dict.get("stratify_training_data", 0) == 0:
                    stratification[i] = 1
        else:
            stratification = None
        if self.freqai_config.get('data_split_parameters', {}).get('test_size', 0.1) != 0:
            (
                train_features,
@@ -160,7 +154,7 @@ class FreqaiDataKitchen:
                filtered_dataframe[: filtered_dataframe.shape[0]],
                labels,
                weights,
-                stratify=stratification,
+                shuffle=shuffle,
                **self.config["freqai"]["data_split_parameters"],
            )
        else: