def make_pipeline(*steps, **kwargs): """Construct a Pipeline from the given estimators.
This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.
Parameters ---------- *steps : list of estimators,
memory : None, str or object with the joblib.Memory interface, optional Used to cache the fitted transformers of the pipeline. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.
class RobustScaler(BaseEstimator, TransformerMixin): """Scale features using statistics that are robust to outliers.
This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). Centering and scaling happen independently on each feature (or each sample, depending on the ``axis`` argument) by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the ``transform`` method.
Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.
.. versionadded:: 0.17
Read more in the :ref:`User Guide `.
Parameters ---------- with_centering : boolean, True by default If True, center the data before scaling. This will cause ``transform`` to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_scaling : boolean, True by default If True, scale the data to interquartile range.
quantile_range : tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0 Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate ``scale_``.
.. versionadded:: 0.18
copy : boolean, optional, default is True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
Attributes ---------- center_ : array of floats The median value for each feature in the training set.
scale_ : array of floats The (scaled) interquartile range for each feature in the training set.
.. versionadded:: 0.17 *scale_* attribute.
See also -------- robust_scale: Equivalent function without the estimator API.
:class:`sklearn.decomposition.PCA` Further removes the linear correlation across features with 'whiten=True'.
Notes ----- For a comparison of the different scalers, transformers, and normalizers, see :ref:`examples/preprocessing/plot_all_scaling.py `.
def _check_array(self, X, copy): """Makes sure centering is not enabled for sparse matrices.""" X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy, estimator=self, dtype=FLOAT_DTYPES)
if sparse.issparse(X): if self.with_centering: raise ValueError( "Cannot center sparse matrices: use `with_centering=False`" " instead. See docstring for motivation and alternatives.") return X
def fit(self, X, y=None): """Compute the median and quantiles to be used for scaling.
Parameters ---------- X : array-like, shape [n_samples, n_features] The data used to compute the median and quantiles used for later scaling along the features axis. """ if sparse.issparse(X): raise TypeError("RobustScaler cannot be fitted on sparse inputs") X = self._check_array(X, self.copy) if self.with_centering: self.center_ = np.median(X, axis=0)
def cross_val_score Found at: sklearn.model_selection._validation
def cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs'): """Evaluate a score by cross-validation Read more in the :ref:`User Guide `.
通过交叉验证来评估一个分数
更多信息参见:ref: ' User Guide '。
Parameters ---------- estimator : estimator object implementing 'fit' The object to use to fit the data.
X : array-like The data to fit. Can be for example a list, or an array.
y : array-like, optional, default: None The target variable to try to predict in the case of supervised learning.
groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set.
scoring : string, callable or None, optional, default: None A string (see model evaluation documentation) or a scorer callable object / function with signature ``scorer(estimator, X, y)``.
cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 3-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - An object to be used as a cross-validation generator. - An iterable yielding train, test splits. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.
Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here.
n_jobs : integer, optional The number of CPUs to use to do the computation. -1 means 'all CPUs'.
verbose : integer, optional The verbosity level.
fit_params : dict, optional Parameters to pass to the fit method of the estimator.
pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
- None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A string, giving an expression as a function of n_jobs, as in '2*n_jobs'
Returns ------- scores : array of float, shape=(len(list(cv)),) Array of scores of the estimator for each run of the cross validation.
Examples -------- >>> from sklearn import datasets, linear_model >>> from sklearn.model_selection import cross_val_score >>> diabetes = datasets.load_diabetes() >>> X = diabetes.data[:150] >>> y = diabetes.target[:150] >>> lasso = linear_model.Lasso() >>> print(cross_val_score(lasso, X, y)) # doctest: +ELLIPSIS [ 0.33150734 0.08022311 0.03531764]
See Also --------- :func:`sklearn.model_selection.cross_validate`: To run cross-validation on multiple metrics and also to return train scores, fit times and score times.
:func:`sklearn.metrics.make_scorer`: Make a scorer from a performance metric or loss function.
""" # To ensure multimetric format is not supported scorer = check_scoring(estimator, scoring=scoring) cv_results = cross_validate(estimator=estimator, X=X, y=y, groups=groups, scoring={'score':scorer}, cv=cv, return_train_score=False, n_jobs=n_jobs, verbose=verbose, fit_params=fit_params, pre_dispatch=pre_dispatch) return cv_results['test_score']