sklearn导入数据集的问题
来源:3-8 监督学习:多元线性回归算法
weixin_慕哥3021856
2023-05-29 22:15:00
相关代码:
data = datasets.load_boston()
问题描述:
1.2版本导入数据集波士顿房价已经不可用。
相关截图:
ImportError Traceback (most recent call last) Cell In[35], line 1----> 1 data = datasets.load_boston() File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\datasets\__init__.py:156, in __getattr__(name) 105 if name == "load_boston": 106 msg = textwrap.dedent( 107 """ 108 `load_boston` has been removed from scikit-learn since version 1.2. (...) 154 """ 155 )--> 156 raise ImportError(msg) 157 try: 158 return globals()[name]ImportError: `load_boston` has been removed from scikit-learn since version 1.2. The Boston housing prices dataset has an ethical problem: as investigated in [1], the authors of this dataset engineered a non-invertible variable "B" assuming that racial self-segregation had a positive impact on house prices [2]. Furthermore the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption. The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning. In this special case, you can fetch the dataset from the original source:: import pandas as pd import numpy as np data_url = "http://lib.stat.cmu.edu/datasets/boston" raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None) data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]]) target = raw_df.values[1::2, 2] Alternative datasets include the California housing dataset and the Ames housing dataset. You can load the datasets as follows:: from sklearn.datasets import fetch_california_housing housing = fetch_california_housing() for the California housing dataset and:: from sklearn.datasets import fetch_openml housing = fetch_openml(name="house_prices", as_frame=True) for the Ames housing dataset. [1] M Carlisle. "Racist data destruction?" <https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8> [2] Harrison Jr, David, and Daniel L. Rubinfeld. "Hedonic housing prices and the demand for clean air." Journal of environmental economics and management 5.1 (1978): 81-102. <https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>
尝试过的解决方式:
按照提示给的方法,可正常导入其他地方的房价数据,但遇到一个问题,如图,这个怎么办?
from sklearn.datasets import fetch_openml # 数据集 data = fetch_openml(name="house_prices", as_frame=True, parser='auto') x = pd.DataFrame(data.data,columns=data.feature_names) y = pd.DataFrame(data.target,columns=['SalePrice']) # 数据集拆分:train, test x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3) # 模型训练 lr = LinearRegression() lr.fit(x_train,y_train)
模型训练这里报错了。
报错信息:
ValueError Traceback (most recent call last) Cell In[28], line 2 1 lr = LinearRegression()----> 2 lr.fit(x_train,y_train)File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model\_base.py:648, in LinearRegression.fit(self, X, y, sample_weight) 644 n_jobs_ = self.n_jobs 646 accept_sparse = False if self.positive else ["csr", "csc", "coo"]--> 648 X, y = self._validate_data( 649 X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True 650 ) 652 sample_weight = _check_sample_weight( 653 sample_weight, X, dtype=X.dtype, only_non_negative=True 654 ) 656 X, y, X_offset, y_offset, X_scale = _preprocess_data( 657 X, 658 y, (...) 661 sample_weight=sample_weight, 662 ) File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\base.py:584, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params) 582 y = check_array(y, input_name="y", **check_y_params) 583 else:--> 584 X, y = check_X_y(X, y, **check_params) 585 out = X, y 587 if not no_val_X and check_params.get("ensure_2d", True): File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\validation.py:1106, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator) 1101 estimator_name = _check_estimator_name(estimator) 1102 raise ValueError( 1103 f"{estimator_name} requires y to be passed, but the target y is None" 1104 )-> 1106 X = check_array( 1107 X, 1108 accept_sparse=accept_sparse, 1109 accept_large_sparse=accept_large_sparse, 1110 dtype=dtype, 1111 order=order, 1112 copy=copy, 1113 force_all_finite=force_all_finite, 1114 ensure_2d=ensure_2d, 1115 allow_nd=allow_nd, 1116 ensure_min_samples=ensure_min_samples, 1117 ensure_min_features=ensure_min_features, 1118 estimator=estimator, 1119 input_name="X", 1120 ) 1122 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator) 1124 check_consistent_length(X, y) File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\validation.py:879, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name) 877 array = xp.astype(array, dtype, copy=False) 878 else:--> 879 array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp) 880 except ComplexWarning as complex_warning: 881 raise ValueError( 882 "Complex data not supported\n{}\n".format(array) 883 ) from complex_warningFile ~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\_array_api.py:185, in _asarray_with_order(array, dtype, order, copy, xp) 182 xp, _ = get_namespace(array) 183 if xp.__name__ in {"numpy", "numpy.array_api"}: 184 # Use NumPy API to support order--> 185 array = numpy.asarray(array, order=order, dtype=dtype) 186 return xp.asarray(array, copy=copy) 187 else: File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py:1998, in NDFrame.__array__(self, dtype) 1996 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray: 1997 values = self._values-> 1998 arr = np.asarray(values, dtype=dtype) 1999 if ( 2000 astype_is_view(values.dtype, arr.dtype) 2001 and using_copy_on_write() 2002 and self._mgr.is_single_block 2003 ): 2004 # Check if both conversions can be done without a copy 2005 if astype_is_view(self.dtypes.iloc[0], values.dtype) and astype_is_view( 2006 values.dtype, arr.dtype 2007 ):ValueError: could not convert string to float: 'RL'
1回答
好帮手慕小猿
2023-05-30
同学,你好!建议降级为1.1版本可以解决此问题。报错提示数据类型不正确,需要将列的字符串转类型进行数据预处理,暂时还未学习到。
祝学习愉快~
相似问题