Почему regex в качестве параметра xml.findall не работает?

Question

Почему regex в качестве параметра xml.findall не работает?

Рейтинг: 0Ответов: 1Опубликовано: 12.03.2023

Есть xml-файл примерно вот такого содержания:

<?xml version="1.0" encoding="utf-8"?>
<recipes>
<recipe>
<id>44123</id>
<steps>
<step has_minutes="1">in 1 / 4 cup butter , saute carrots , onion , celery and broccoli stems for 5 minutes</step>
<step>add thyme , oregano and basil</step>
<step has_minutes="1">saute 5 minutes more</step>
<step has_minutes="1">simmer 5 minutes</step>
<step has_minutes="1">add cream , simmer 5 minutes more and season to taste</step>
<step>drop in remaining butter , piece by piece , stirring until melted and serve immediately</step>
<step>smoked chicken: on a covered grill , slightly smoke boneless chicken , cooking to medium rare</step>
<step>chef meskan uses applewood chips and does not allow the grill to become too hot</step>
</steps>
</recipe>
<recipe>
<id>67664</id>
<steps>
<step>mix all the ingredients using a blender</step>
<step>pour into popsicle molds</step>
<step>freeze and enjoy !</step>
</steps>
</recipe>
<recipe>
<id>38798</id>
<steps>
<step has_degrees="1">preheat oven to 350 degrees</step>
<step>place on ungreased baking sheet and bake until light brown</step>
</steps>
</recipe>
(и так далее)

Пытаюсь найти в нём все id рецептов, у которых для steps указаны минуты или часы вот таким кодом:

has_time = []
t = re.compile(r'step has_minutes=[^\n]*"')
for r in xml.find_all('recipe'):
  if r.find_all(t):
    id = r.find('id').text
    has_time.append(id)

print(has_time)

Почему-то выводит пустой список. Если регулярное выражение заменить на просто step, то вполне нормально находит какие-либо рецепты. Если вместо регулярки строго указать в параметре step has_minutes="1" снова не находит не одного рецепта.

Как сделать так, чтобы рецепты с временем у шагов (steps) находились?

python xml регулярные-выражения beautiful-soup

Источник: Stack Overflow на русском

Answer 1

▲ 1Принят

С использованием стандартной библиотеки для анализа xml файлов:

import xml.etree.ElementTree as ET

ids = []

# Создаем дерево и итерируемся по нему.

tree = ET.parse('test.xml')
root = tree.getroot()
for child in root:
    num_id = child.find('id').text
    steps = [value for value in child.findall("steps/step") if value.attrib.get('has_minutes')]

# Собираем id, если есть хоть один step в этом блоке с атрибутом 'has_minutes'.

    if steps:
        ids.append(num_id)
print(ids)

-----------------------
['44123']

Почему regex в качестве параметра xml.findall не работает?

Ответы