有关爬虫写入文件编码的问题
来源:1-1 本周介绍
__________千
2021-12-03 00:43:09

# coding:utf-8
import os
import requests
import re
import json
def work(content,job):
reobj1 = re.compile('"tags":(.*?)\,"jt":"0_0",')
all_item = reobj1.findall(content)
job_name = re.compile('"job_name":"(.*?)"')
company_name = re.compile('"company_name":"(.*?)"')
providesalary_text = re.compile('"providesalary_text":"(.*?)"')
attribute_text = re.compile('"attribute_text":\[(.*?)\]\,"')
jobwelf = re.compile('"jobwelf":"(.*?)"')
companysize_text = re.compile('"companysize_text":"(.*?)"')
companyind_text = re.compile('"companyind_text":"(.*?)"')
for item in all_item:
all_attribute_text = attribute_text.findall(item)[0].split(',')
if len(all_attribute_text) == 4:
dict = {
'职位名称': job_name.findall(item)[0],
'公司名称': company_name.findall(item)[0],
'薪资': providesalary_text.findall(item)[0].replace('\\', ''),
'工作地点': eval(all_attribute_text[0]),
'工作经验': eval(all_attribute_text[1]),
'学历': eval(all_attribute_text[2]),
'招收人数': eval(all_attribute_text[3]),
'待遇': jobwelf.findall(item)[0],
'公司规模': companysize_text.findall(item)[0],
'工作主要内容': companyind_text.findall(item)[0].replace('\\', '')
}
print(dict)
dict = json.dumps(dict)
if os.path.exists('{}51job职位.txt'.format(job)):
with open("{}51job职位.txt".format(job),'a+',encoding='utf-8') as f:
f.write(dict+'\n')
else:
with open("{}51job职位.txt".format(job),'w', encoding='utf-8') as f:
f.write(dict+'\n')
def main():
job = input('请输入你查找的工作:')
url = 'https://search.51job.com/list/000000,000000,0000,00,9,99,{},2,{}.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare='
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34'
}
for i in range(1,10):
response = requests.get(url=url.format(job,i),headers=headers)
work(response.text,job)
if __name__ == '__main__':
main()老师为啥我写进文件不是中文的
1回答
好帮手慕凡
2021-12-03
同学,你好!
1、Python 3将unicode作为默认编码;
2、Python 3中的json在做dumps()操作时,会将中文转换成unicode编码,并以16进制方式存储,再做loads()逆向操作时,会将unicode编码转换回中文,因此json.dumps操作后,得到的字符串是\uXXXX。
解决方法:
json.dumps( )中有一个ensure_ascii参数,当它为True的时候,所有非ASCII码字符显示为\uXXXX序列。在dump时将ensure_ascii设置为False,此时存入json的中文可正常显示,如下图:

祝学习愉快~