有关爬虫写入文件编码的问题
来源:1-1 本周介绍
__________千
2021-12-03 00:43:09
# coding:utf-8 import os import requests import re import json def work(content,job): reobj1 = re.compile('"tags":(.*?)\,"jt":"0_0",') all_item = reobj1.findall(content) job_name = re.compile('"job_name":"(.*?)"') company_name = re.compile('"company_name":"(.*?)"') providesalary_text = re.compile('"providesalary_text":"(.*?)"') attribute_text = re.compile('"attribute_text":\[(.*?)\]\,"') jobwelf = re.compile('"jobwelf":"(.*?)"') companysize_text = re.compile('"companysize_text":"(.*?)"') companyind_text = re.compile('"companyind_text":"(.*?)"') for item in all_item: all_attribute_text = attribute_text.findall(item)[0].split(',') if len(all_attribute_text) == 4: dict = { '职位名称': job_name.findall(item)[0], '公司名称': company_name.findall(item)[0], '薪资': providesalary_text.findall(item)[0].replace('\\', ''), '工作地点': eval(all_attribute_text[0]), '工作经验': eval(all_attribute_text[1]), '学历': eval(all_attribute_text[2]), '招收人数': eval(all_attribute_text[3]), '待遇': jobwelf.findall(item)[0], '公司规模': companysize_text.findall(item)[0], '工作主要内容': companyind_text.findall(item)[0].replace('\\', '') } print(dict) dict = json.dumps(dict) if os.path.exists('{}51job职位.txt'.format(job)): with open("{}51job职位.txt".format(job),'a+',encoding='utf-8') as f: f.write(dict+'\n') else: with open("{}51job职位.txt".format(job),'w', encoding='utf-8') as f: f.write(dict+'\n') def main(): job = input('请输入你查找的工作:') url = 'https://search.51job.com/list/000000,000000,0000,00,9,99,{},2,{}.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34' } for i in range(1,10): response = requests.get(url=url.format(job,i),headers=headers) work(response.text,job) if __name__ == '__main__': main()
老师为啥我写进文件不是中文的
1回答
好帮手慕凡
2021-12-03
同学,你好!
1、Python 3将unicode作为默认编码;
2、Python 3中的json在做dumps()操作时,会将中文转换成unicode编码,并以16进制方式存储,再做loads()逆向操作时,会将unicode编码转换回中文,因此json.dumps操作后,得到的字符串是\uXXXX。
解决方法:
json.dumps( )中有一个ensure_ascii参数,当它为True的时候,所有非ASCII码字符显示为\uXXXX序列。在dump时将ensure_ascii设置为False,此时存入json的中文可正常显示,如下图:
祝学习愉快~