[Web] exchange rate crawler

<註:用 c shell script 爬已不管用 已經有用 python 重爬 點此>



最近架了主機,也下了python的爬蟲工具玩了一下,恩,我覺得這是一塊蠻值得鑽研的地方。



老爸之前就跟我說過,公司需要銀行匯率的檔案,之前同事寫的java執行檔,在網頁改格式後就不管用了,但這小工具,一直沒人寫,所以一直用手工。



http://rate.bot.com.tw/Pages/Static/UIP003.zh-TW.htm ->就是這裡



適逢我正在玩scrapy,就用了工具爬了一下,大概長這樣




  • Item:


import scrapy
class RateItem(scrapy.Item):
currency = scrapy.Field()
value = scrapy.Field()
pass
view raw rate_item.py hosted with ❤ by GitHub



  • spider:


import scrapy
import sys
from rate.items import RateItem
class RateSpider(scrapy.Spider):
name = "rate_spider"
allowed_domains = ["bot.com.tw"]
start_urls = ['http://rate.bot.com.tw/Pages/Static/UIP003.zh-TW.htm']
def parse(self, response):
item = RateItem()
all_data = response.xpath('//*[@id="slice1"]');
item['currency'] = all_data.css('td.titleLeft::text').extract()
item['value'] = all_data.css('td.decimal::text').extract()
return item
view raw rate_spider.py hosted with ❤ by GitHub



  • pipline:


class RatePipeline(object):
def __init__(self):
pass
def process_item(self, item, spider):
with open('rate.txt' , 'w+') as file:
file.write('%20s %10s %10s %10s %10s' %('currency' , 'cash_in' , 'cash_out' , 'sight_in' , 'sight_out'))
file.write('\n')
i = 0
for country in item['currency'][::]:
file.write( '{0:->25}'.format(str(country.encode('utf-8')))) ,
for j in range(4):
file.write('%10s ' % item['value'][i + j]),
file.write('\n');
i += 4
return item
view raw rate_pipline.py hosted with ❤ by GitHub


爬是爬了,但老爸公司不知有沒有裝python和爬蟲工具,於是我就想說,既然網頁還蠻簡單的,何不用script和C爬爬就好?




  • script:


#!/bin/bash
URL="http://rate.bot.com.tw/Pages/Static/UIP003.zh-TW.htm"
COLS=10 #how many columns do you want to fetch...
C_FILE="fetch.c"
if [ $# == 1 ];then
case $1 in
"h" | "help" | "-h" | "--help")
echo "scrapy [ URL ]"
exit;;
*)
URL=$1;;
esac
fi
wget -O bank_rate.html $URL 2>/dev/null
if [ $? -eq 0 ];then
if [ ! -f fetch.exe ];then
gcc -o fetch.exe $C_FILE
if [ ! $? -eq 0 ];then
echo "compile fetch.c failed"
exit
fi
fi
grep decimal bank_rate.html | ./fetch.exe $COLS > rate.txt
cat rate.txt
rm -f bank_rate.html fetch.exe rate.txt
else
echo "download file failed"
fi
view raw rate_scrapy.sh hosted with ❤ by GitHub



  • C source file named fetch.c:


#include<stdio.h>
#include<stdlib.h>
#include<string.h>
char str[1000];
int back(char *tail){
char *head;
for(head = tail ; ; head--)
if(*head == ';' || *head == '>')
break;
for(head += 1 ; head < tail ; head++)
printf("%c" , *head);
printf(" ");
return 0;
}
int main(int argc , char *argv[]){
if(argc != 2){
puts("Wrong argc");
return -1;
}
int i , cnt = atoi(argv[1]);//cnt is column num
char *tail;
while(fgets(str , sizeof(str) , stdin)){
for(i = 0 ; i < cnt ; i++)
if(tail = strstr(str , "</td>"))
back(tail) , strncpy(tail , "ignore" , 6);
else
break;
puts("");
}
return 0;
}
view raw rate_fetch.c hosted with ❤ by GitHub


但沒想到不過一天,wget就抓不到東西,估計是被擋了,瀏覽器和curl還可正常運作。



研究了下,應該是用User agent擋的,沒關係,加一下參數,冒充一下別人就好了。



wget -U Mozilla/5.0 -O bank_rate.html $URL 2>/dev/null (curl 的話 是用 -A)



註 wget 的 -O 大寫另存檔案 , curl 則是 -o 小寫另存喔!



這樣就扮成了小狐狸了,另外其他瀏覽器也可以裝喔,不論有沒有安裝都可以裝,因為這只是給server一個假名罷了!



留言

這個網誌中的熱門文章

[Antergos] disable touchpad

活著。