在Python中如何用正则表达式提取xml中的<p>之间的内容

<p>When ES cells differentiate, they migrate out from colonies on gelatin-coated dishes, similar to the ES cells on the
<xref ref-type="bibr" rid="pone.0000015-Rogers1">[17]</xref> and <italic>nanog</italic> ,
,<xref ref-type="bibr" rid="pone.0000015-Chambers1">[19]</xref> well-known markers for undifferentiated ES cells. </p>
<p>(A) R1 cells were cultured for 5 days in the presence of
<xref ref-type="bibr" rid="pone.0000015-Rogers1">[1]</xref> and <italic>nanog</italic>
<xref ref-type="bibr" rid="pone.0000015-Mitsui1">[2]</xref>, <xref ref-type="bibr" rid="pone.0000015-Chambers1">[3]</xref> various doses of LIF (0–1,000 units/ml). </p>
(注意上面的<p>到</p>之间有换行)如何用正则表达式,最后得到一个列表,里面的内容为每个<p>到</p>之间的内容,即内容为list=['<p>When ES cells differentiate, they migrate out from colonies on gelatin-coated dishes, similar to the ES cells on the <xref ref-type="bibr" rid="pone.0000015-Rogers1">[17]</xref> and <italic>nanog</italic>, <xref ref-type="bibr" rid="pone.0000015-Chambers1">[19]</xref> well-known markers for undifferentiated ES cells. </p>','<p>(A) R1 cells were cultured for 5 days in the presence of <xref ref-type="bibr"rid="pone.0000015-Rogers1">[1]</xref> and <italic>nanog</italic> <xref ref-type="bibr" rid="pone.0000015-Mitsui1">[2]</xref>, <xref ref-type="bibr" rid="pone.0000015-Chambers1">[3]</xref> various doses of LIF (0–1,000 units/ml). </p>
']

# 代码
html_text = '''
<p>When ES cells differentiate, they migrate out from colonies on gelatin-coated dishes, similar to the ES cells on the 
<xref ref-type="bibr" rid="pone.0000015-Rogers1">[17]</xref> and <italic>nanog</italic> ,
,<xref ref-type="bibr" rid="pone.0000015-Chambers1">[19]</xref> well-known markers for undifferentiated ES cells. </p>
<p>(A) R1 cells were cultured for 5 days in the presence of 
<xref ref-type="bibr" rid="pone.0000015-Rogers1">[1]</xref> and <italic>nanog</italic> 
<xref ref-type="bibr" rid="pone.0000015-Mitsui1">[2]</xref>, <xref ref-type="bibr" rid="pone.0000015-Chambers1">[3]</xref> various doses of LIF (0–1,000 units/ml). </p>
'''

pattern = r'(<p>.*?</p>)'
html_text = re.sub('\n', '', html_text)
text = re.findall(pattern, html_text)
print(text)

# 输出
['<p>When ES cells differentiate, they migrate out from colonies on gelatin-coated dishes, similar to the ES cells on the <xref ref-type="bibr" rid="pone.0000015-Rogers1">[17]</xref> and <italic>nanog</italic> ,,<xref ref-type="bibr" rid="pone.0000015-Chambers1">[19]</xref> well-known markers for undifferentiated ES cells. </p>',
 '<p>(A) R1 cells were cultured for 5 days in the presence of <xref ref-type="bibr" rid="pone.0000015-Rogers1">[1]</xref> and <italic>nanog</italic> <xref ref-type="bibr" rid="pone.0000015-Mitsui1">[2]</xref>, <xref ref-type="bibr" rid="pone.0000015-Chambers1">[3]</xref> various doses of LIF (0–1,000 units/ml). </p>']

温馨提示:答案为网友推荐,仅供参考
第1个回答  2015-08-12
建议用python BeautifulSoup直接对xml进行解析吧,都不要正则匹配!本回答被提问者和网友采纳
第2个回答  2018-05-10
直接用python的库读XML不是更方便