Python Spider to Crawl PangziTV
use python to crawl TV, movies, and photos…
Web spider
The project is to learn how to crawl info by python.
PangziTV top movies download
- finish download image and link and movies list
- create a markdown file to show the movies list and links
- save images of movies
- trying to download movies (finished)
- see pangzitv spider version 2
Install instruction for Mac
- see pangzitv spider version 2
using code
sudo python3 -mbefore pip codes to install requests lib in python 3 instead of default python 2 in Mac1
2
3sudo python3 -m pip install requests
sudo python3 -m pip install beautifulsoup4
sudo python3 -m pip install lxmlAttention:
- some files need to set privilege of read & write for specific user!!!
- some bugs are
tab & spaceissues in vscode - copy python file in your dir (such as
Hexo blog dir) and change it to run.
Don’t forget to fake headers
1 | headers = { |
Some main codes
Use requests to analyze the url
1 | url = 'https://xxxxxxxxxxx' |
change location
1 | # create a new dir to store images |
some main methods
Get specific info by analyzing the HTML code of website
1 | def test_print(soup): |
save text file method (can modify for Hexo Blog)
1 | def save_info_txt(temp_index, item_name, location_url, movie_link): |
download images method
1 | # downloader method |
download movies methods
- trying solve download methods…
- Get m3u8 encode link
- decode by
base64decodeandurllib->parse.unquote(str) - more practice
Regexexpression- use
splitandregexto get validfilenameandurl
- use
- write non-txt file by
abwhichbmeansbinary -
debugcorrect store path - download all
*.tsfile -
MacOs: useffmpegtomergeall*.tsfiles to.mp4fileffmpegusebrew install ffmpeg
- delete all
.tsfiles aftermergedone -
fixsomebugsduring running time - add
progress barfunction
Import all packages by python
1 | import requests |
Some codes about extract and download ‘.ts’ files, merge them, and delete them
1 | # implement download m3u8 files |
some advices
- run python file to update info from website
- copy markdown file into blog
- need delete my_info_file, since it will be generated a Blog
- monthly or daily to update my_info_file
Demo Crawling movies list, links, and images of 1 page
1: 延禧攻略HD

2: 如懿传HD

3: 扶摇

4: 知否?知否?应是绿肥红瘦 DVD版

5: 皓镧传 DVD

6: 庆余年 1080P蓝光

7: 那年花开月正圆

8: 倚天屠龙记

9: 都挺好

10: 封神演义DVD 1080P

11: 将夜1 DVD

12: 琅琊榜之风起长林

13: 盗墓笔记少年篇之沙海

14: 小欢喜

15: 宸汐缘

16: 陈情令

17: 古董局中局 DVD版

18: 白发/白发皇妃/白发王妃

19: 小女花不弃 DVD

20: 一起同过窗 第二季

21: 三生三世十里桃花

22: 全职高手

23: 九州海上牧云记

24: 黄金瞳

25: 烈火如歌

26: 亲爱的,热爱的

27: 大江大河 DVD版

28: 东宫

29: 盗墓笔记2之怒海潜沙&秦岭神树

30: 河神1

Python Spider to Crawl PangziTV
https://bestbonbai.me/2021/01/02/Python-Spider-to-Crawl-PangziTV/







