<tr class="row" onmouseover="this.className='row1'" onmouseout="this.className='row'">
<td height="20"><a href="/gc/123.html" target="_blank">皇马</a></td>
<td align="center">皇马</td>
<td align="center">大陆</td>
<td align="center"><a href="/gc/123.html" target="_blank">点击进入</a></td>
<td align="center">12月26日</td>
</tr>
如上述html所示 网页中存在大量的<tr class="row" onmouseover="this.className='row1'" onmouseout="this.className='row'">
所以我想先用函数 findAll找到上述tr标签 之后再去获取a标签中的链接/gc/123.html
该如何做呢?
采用BeautifulSoup 可以这样做
import urllib
import sys
import re
from BeautifulSoup import BeautifulSoup
fp = open("文档",'r')
for eachurl in fp:
urlhandle = urllib.urlopen(eachurl)
content = urlhandle.read()
parser = BeautifulSoup(content)
res = parser.findAll('tr',{'onmouseout':'this.className=\'row\'','onmouseover':'this.className=\'row1\''})
for my in res:
state = []
for a in my.a['href']:
if a != None:
state.append(a)
print ''.join(state)
fp.close()
è½ä¸è½ç¨BeautifulSoupè¿ä¸ªæ件å¢ï¼
追çlxml å é¨å¸¦æäºbeautifulsoup,建议æ¹lxmlå§