在Python中如何用正则表达式提取xml中的之间的内容

When ES cells differentiate, they migrate out from colonies on gelatin-coated dishes, similar to the ES cells on the
<xref ref-type="bibr" rid="pone.0000015-Rogers1">[17]</xref> and <italic>nanog</italic> ,
,<xref ref-type="bibr" rid="pone.0000015-Chambers1">[19]</xref> well-known markers for undifferentiated ES cells. 
(A) R1 cells were cultured for 5 days in the presence of
<xref ref-type="bibr" rid="pone.0000015-Rogers1">[1]</xref> and <italic>nanog</italic>
<xref ref-type="bibr" rid="pone.0000015-Mitsui1">[2]</xref>, <xref ref-type="bibr" rid="pone.0000015-Chambers1">[3]</xref> various doses of LIF (0–1,000 units/ml). 
（注意上面的到之间有换行）如何用正则表达式，最后得到一个列表，里面的内容为每个到之间的内容，即内容为list=['When ES cells differentiate, they migrate out from colonies on gelatin-coated dishes, similar to the ES cells on the <xref ref-type="bibr" rid="pone.0000015-Rogers1">[17]</xref> and <italic>nanog</italic>, <xref ref-type="bibr" rid="pone.0000015-Chambers1">[19]</xref> well-known markers for undifferentiated ES cells. ','(A) R1 cells were cultured for 5 days in the presence of <xref ref-type="bibr"rid="pone.0000015-Rogers1">[1]</xref> and <italic>nanog</italic> <xref ref-type="bibr" rid="pone.0000015-Mitsui1">[2]</xref>, <xref ref-type="bibr" rid="pone.0000015-Chambers1">[3]</xref> various doses of LIF (0–1,000 units/ml). 
']

举报该问题

推荐答案 2018-05-10

# 代码
html_text = '''
When ES cells differentiate, they migrate out from colonies on gelatin-coated dishes, similar to the ES cells on the
<xref ref-type="bibr" rid="pone.0000015-Rogers1">[17]</xref> and <italic>nanog</italic> ,
,<xref ref-type="bibr" rid="pone.0000015-Chambers1">[19]</xref> well-known markers for undifferentiated ES cells. 
(A) R1 cells were cultured for 5 days in the presence of
<xref ref-type="bibr" rid="pone.0000015-Rogers1">[1]</xref> and <italic>nanog</italic>
<xref ref-type="bibr" rid="pone.0000015-Mitsui1">[2]</xref>, <xref ref-type="bibr" rid="pone.0000015-Chambers1">[3]</xref> various doses of LIF (0–1,000 units/ml). 
'''

pattern = r'(.*?)'
html_text = re.sub('\n', '', html_text)
text = re.findall(pattern, html_text)
print(text)

# 输出
['When ES cells differentiate, they migrate out from colonies on gelatin-coated dishes, similar to the ES cells on the <xref ref-type="bibr" rid="pone.0000015-Rogers1">[17]</xref> and <italic>nanog</italic> ,,<xref ref-type="bibr" rid="pone.0000015-Chambers1">[19]</xref> well-known markers for undifferentiated ES cells. ',
'(A) R1 cells were cultured for 5 days in the presence of <xref ref-type="bibr" rid="pone.0000015-Rogers1">[1]</xref> and <italic>nanog</italic> <xref ref-type="bibr" rid="pone.0000015-Mitsui1">[2]</xref>, <xref ref-type="bibr" rid="pone.0000015-Chambers1">[3]</xref> various doses of LIF (0–1,000 units/ml). ']

温馨提示：答案为网友推荐，仅供参考

当前网址：http://22.wendadaohang.com/zd/2fh0Xff6fh20ThTIh2.html

其他回答

第1个回答 2015-08-12

建议用python BeautifulSoup直接对xml进行解析吧，都不要正则匹配！本回答被提问者和网友采纳

第2个回答 2018-05-10

直接用python的库读XML不是更方便

相似回答

python 正则表达式如何截取字符串中间的内容答：示例代码启动ipython先导入re模块这里我用到了findall全局匹配，分为两部分，前面是正则，后面是要匹配的字符串得到一个元祖通过变量a 就能打印出想要的字符串

正则提取中间的内容?答：这个表达式会匹配文本中以 Dialogue: 开头，后面跟着任意字符，然后以一个逗号结尾的部分。它会提取括号中的内容，即【Default9】。例如，在使用 Python 的 re 模块时，可以这样使用这个正则表达式：运行上面的代码，会输出：请注意，这里的正则表达式并不能匹配所有情况，如果你想要更精确地匹配，可能需要...

大家正在搜

正则表达式提取中文 excel正则表达式提取正则表达式指定位置提取正则表达式提取数字正则表达式提取字符串 vba正则表达式提取 excel使用正则表达式常用的正则表达式正则表达式怎么用

在Python中如何用正则表达式提取xml中的<p>之间的内容