[Web] exchange rate crawler
<註:用 c shell script 爬已不管用 已經有用 python 重爬 點此>
最近架了主機,也下了python的爬蟲工具玩了一下,恩,我覺得這是一塊蠻值得鑽研的地方。
老爸之前就跟我說過,公司需要銀行匯率的檔案,之前同事寫的java執行檔,在網頁改格式後就不管用了,但這小工具,一直沒人寫,所以一直用手工。
http://rate.bot.com.tw/Pages/Static/UIP003.zh-TW.htm ->就是這裡
適逢我正在玩scrapy,就用了工具爬了一下,大概長這樣
- Item:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import scrapy | |
class RateItem(scrapy.Item): | |
currency = scrapy.Field() | |
value = scrapy.Field() | |
pass |
- spider:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import scrapy | |
import sys | |
from rate.items import RateItem | |
class RateSpider(scrapy.Spider): | |
name = "rate_spider" | |
allowed_domains = ["bot.com.tw"] | |
start_urls = ['http://rate.bot.com.tw/Pages/Static/UIP003.zh-TW.htm'] | |
def parse(self, response): | |
item = RateItem() | |
all_data = response.xpath('//*[@id="slice1"]'); | |
item['currency'] = all_data.css('td.titleLeft::text').extract() | |
item['value'] = all_data.css('td.decimal::text').extract() | |
return item |
- pipline:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class RatePipeline(object): | |
def __init__(self): | |
pass | |
def process_item(self, item, spider): | |
with open('rate.txt' , 'w+') as file: | |
file.write('%20s %10s %10s %10s %10s' %('currency' , 'cash_in' , 'cash_out' , 'sight_in' , 'sight_out')) | |
file.write('\n') | |
i = 0 | |
for country in item['currency'][::]: | |
file.write( '{0:->25}'.format(str(country.encode('utf-8')))) , | |
for j in range(4): | |
file.write('%10s ' % item['value'][i + j]), | |
file.write('\n'); | |
i += 4 | |
return item |
爬是爬了,但老爸公司不知有沒有裝python和爬蟲工具,於是我就想說,既然網頁還蠻簡單的,何不用script和C爬爬就好?
- script:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
URL="http://rate.bot.com.tw/Pages/Static/UIP003.zh-TW.htm" | |
COLS=10 #how many columns do you want to fetch... | |
C_FILE="fetch.c" | |
if [ $# == 1 ];then | |
case $1 in | |
"h" | "help" | "-h" | "--help") | |
echo "scrapy [ URL ]" | |
exit;; | |
*) | |
URL=$1;; | |
esac | |
fi | |
wget -O bank_rate.html $URL 2>/dev/null | |
if [ $? -eq 0 ];then | |
if [ ! -f fetch.exe ];then | |
gcc -o fetch.exe $C_FILE | |
if [ ! $? -eq 0 ];then | |
echo "compile fetch.c failed" | |
exit | |
fi | |
fi | |
grep decimal bank_rate.html | ./fetch.exe $COLS > rate.txt | |
cat rate.txt | |
rm -f bank_rate.html fetch.exe rate.txt | |
else | |
echo "download file failed" | |
fi |
- C source file named fetch.c:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include<stdio.h> | |
#include<stdlib.h> | |
#include<string.h> | |
char str[1000]; | |
int back(char *tail){ | |
char *head; | |
for(head = tail ; ; head--) | |
if(*head == ';' || *head == '>') | |
break; | |
for(head += 1 ; head < tail ; head++) | |
printf("%c" , *head); | |
printf(" "); | |
return 0; | |
} | |
int main(int argc , char *argv[]){ | |
if(argc != 2){ | |
puts("Wrong argc"); | |
return -1; | |
} | |
int i , cnt = atoi(argv[1]);//cnt is column num | |
char *tail; | |
while(fgets(str , sizeof(str) , stdin)){ | |
for(i = 0 ; i < cnt ; i++) | |
if(tail = strstr(str , "</td>")) | |
back(tail) , strncpy(tail , "ignore" , 6); | |
else | |
break; | |
puts(""); | |
} | |
return 0; | |
} |
但沒想到不過一天,wget就抓不到東西,估計是被擋了,瀏覽器和curl還可正常運作。
研究了下,應該是用User agent擋的,沒關係,加一下參數,冒充一下別人就好了。
wget -U Mozilla/5.0 -O bank_rate.html $URL 2>/dev/null (curl 的話 是用 -A)
註 wget 的 -O 大寫另存檔案 , curl 則是 -o 小寫另存喔!
這樣就扮成了小狐狸了,另外其他瀏覽器也可以裝喔,不論有沒有安裝都可以裝,因為這只是給server一個假名罷了!
留言
張貼留言