Generador de mapas del sitio V2 - Python Programming Exercise

En este ejercicio, desarrollarás una versión mejorada de un generador de mapas del sitio en Python (Sitemap Generator V2). Este ejercicio es perfecto para practicar el rastreo web, el manejo de contenido dinámico y la generación de XML en Python. Al implementar este programa, obtendrás experiencia práctica en el manejo del rastreo web, contenido dinámico y generación de XML en Python. Este ejercicio no solo refuerza tu comprensión del rastreo web, sino que también te ayuda a desarrollar prácticas de codificación eficientes para gestionar las interacciones con el usuario. Además, este ejercicio proporciona una excelente oportunidad para explorar la versatilidad de Python en aplicaciones del mundo real. Al trabajar con el rastreo web, contenido dinámico y generación de XML, aprenderás a estructurar tu código de manera eficiente, lo cual es una habilidad crucial en muchos escenarios de programación. Este ejercicio también te anima a pensar críticamente sobre cómo estructurar tu código para la legibilidad y el rendimiento, convirtiéndolo en una valiosa adición a tu portafolio de programación. Ya seas un principiante o un programador experimentado, este ejercicio te ayudará a profundizar tu comprensión de Python y mejorar tus habilidades para resolver problemas.

Categoría

Uso de bibliotecas adicionales

Ejercicio

Generador De Mapas Del Sitio V2

Objectivo

Desarrollar una versión mejorada de un generador de mapas de sitio en Python (Sitemap Generator V2). Este programa debería rastrear un sitio web de manera más eficiente, manejar contenido dinámico (como páginas controladas por JavaScript) y generar un mapa de sitio XML que incluya URL, fechas de última modificación y etiquetas de prioridad. El programa debería admitir enlaces internos y externos y permitir al usuario especificar la profundidad máxima del rastreo. Incluir manejo de errores para enlaces rotos, URL no válidas o problemas de red.

Ejemplo de ejercicio de Python

Mostrar código Python

Copiar código Python

import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin
import time
import sitemapxml

def is_valid_url(url):
    """Check if a URL is valid."""
    try:
        response = requests.head(url, allow_redirects=True, timeout=5)
        return response.status_code == 200
    except requests.exceptions.RequestException:
        return False

def get_soup(url):
    """Get the BeautifulSoup object for a given URL."""
    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()
        return BeautifulSoup(response.text, 'html.parser')
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

def get_page_info(url):
    """Get metadata for the page (last modified, priority)."""
    try:
        response = requests.head(url, allow_redirects=True, timeout=5)
        last_modified = response.headers.get('Last-Modified', 'N/A')
        priority = 0.5  # Default priority
        return last_modified, priority
    except requests.exceptions.RequestException as e:
        print(f"Error fetching metadata for {url}: {e}")
        return 'N/A', 0.5

def crawl_website(start_url, max_depth, current_depth=0, visited_urls=None):
    """Crawl the website to extract all internal links."""
    if visited_urls is None:
        visited_urls = set()

# Stop the crawl if the maximum depth is reached
    if current_depth >= max_depth:
        return visited_urls

soup = get_soup(start_url)
    if not soup:
        return visited_urls

# Add the current URL to the visited set
    visited_urls.add(start_url)

# Find all links on the page
    links = soup.find_all('a', href=True)

for link in links:
        href = link['href']
        full_url = urljoin(start_url, href)

# Only crawl internal links
        if urlparse(full_url).netloc == urlparse(start_url).netloc:
            if full_url not in visited_urls:
                visited_urls.add(full_url)
                # Recurse into the link if the depth limit is not reached
                crawl_website(full_url, max_depth, current_depth + 1, visited_urls)

return visited_urls

def generate_sitemap(urls, output_file):
    """Generate an XML sitemap and save it to a file."""
    sitemap = sitemapxml.Sitemap()

for url in urls:
        last_modified, priority = get_page_info(url)
        sitemap.add(url, last_modified=last_modified, priority=priority)
    
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(sitemap.toxml())

def main():
    """Main function to initiate the sitemap generation."""
    start_url = input("Enter the starting URL (e.g., https://example.com): ").strip()
    max_depth = int(input("Enter the maximum depth for crawling: ").strip())
    output_file = input("Enter the output file name for the sitemap (e.g., sitemap.xml): ").strip()

print(f"Crawling website: {start_url} with a maximum depth of {max_depth}...")

# Start the crawling process
    visited_urls = crawl_website(start_url, max_depth)

# Generate the sitemap and save it to a file
    if visited_urls:
        print(f"Found {len(visited_urls)} URLs. Generating sitemap...")
        generate_sitemap(visited_urls, output_file)
        print(f"Sitemap saved to {output_file}")
    else:
        print("No URLs were found to include in the sitemap.")

if __name__ == '__main__':
    main()

import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin
import time
import sitemapxml

def is_valid_url(url):
    """Check if a URL is valid."""
    try:
        response = requests.head(url, allow_redirects=True, timeout=5)
        return response.status_code == 200
    except requests.exceptions.RequestException:
        return False

def get_soup(url):
    """Get the BeautifulSoup object for a given URL."""
    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()
        return BeautifulSoup(response.text, 'html.parser')
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

def get_page_info(url):
    """Get metadata for the page (last modified, priority)."""
    try:
        response = requests.head(url, allow_redirects=True, timeout=5)
        last_modified = response.headers.get('Last-Modified', 'N/A')
        priority = 0.5  # Default priority
        return last_modified, priority
    except requests.exceptions.RequestException as e:
        print(f"Error fetching metadata for {url}: {e}")
        return 'N/A', 0.5

def crawl_website(start_url, max_depth, current_depth=0, visited_urls=None):
    """Crawl the website to extract all internal links."""
    if visited_urls is None:
        visited_urls = set()

    # Stop the crawl if the maximum depth is reached
    if current_depth >= max_depth:
        return visited_urls

    soup = get_soup(start_url)
    if not soup:
        return visited_urls

    # Add the current URL to the visited set
    visited_urls.add(start_url)

    # Find all links on the page
    links = soup.find_all('a', href=True)

    for link in links:
        href = link['href']
        full_url = urljoin(start_url, href)

        # Only crawl internal links
        if urlparse(full_url).netloc == urlparse(start_url).netloc:
            if full_url not in visited_urls:
                visited_urls.add(full_url)
                # Recurse into the link if the depth limit is not reached
                crawl_website(full_url, max_depth, current_depth + 1, visited_urls)

    return visited_urls

def generate_sitemap(urls, output_file):
    """Generate an XML sitemap and save it to a file."""
    sitemap = sitemapxml.Sitemap()

    for url in urls:
        last_modified, priority = get_page_info(url)
        sitemap.add(url, last_modified=last_modified, priority=priority)
    
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(sitemap.toxml())

def main():
    """Main function to initiate the sitemap generation."""
    start_url = input("Enter the starting URL (e.g., https://example.com): ").strip()
    max_depth = int(input("Enter the maximum depth for crawling: ").strip())
    output_file = input("Enter the output file name for the sitemap (e.g., sitemap.xml): ").strip()

    print(f"Crawling website: {start_url} with a maximum depth of {max_depth}...")

    # Start the crawling process
    visited_urls = crawl_website(start_url, max_depth)

    # Generate the sitemap and save it to a file
    if visited_urls:
        print(f"Found {len(visited_urls)} URLs. Generating sitemap...")
        generate_sitemap(visited_urls, output_file)
        print(f"Sitemap saved to {output_file}")
    else:
        print("No URLs were found to include in the sitemap.")

if __name__ == '__main__':
    main()

Output

Enter the starting URL (e.g., https://example.com): https://example.com
Enter the maximum depth for crawling: 2
Enter the output file name for the sitemap (e.g., sitemap.xml): sitemap.xml
Crawling website: https://example.com with a maximum depth of 2...
Found 55 URLs. Generating sitemap...
Sitemap saved to sitemap.xml

Código de ejemplo copiado

Más Ejercicios Programación Python de Uso de bibliotecas adicionales

¡Explora nuestro conjunto de ejercicios de programación Python! Estos ejercicios, diseñados específicamente para principiantes, te ayudarán a desarrollar una sólida comprensión de los conceptos básicos de Python. Desde variables y tipos de datos hasta estructuras de control y funciones simples, cada ejercicio está diseñado para desafiarte de manera gradual a medida que adquieres confianza en la codificación en Python.

Explorando un directorio
En este ejercicio, desarrollarás un programa en Python que explore un directorio especificado y liste todo su contenido, incluidos archivos y subdirectorios. Este ...
Explorando Subdirectorios
En este ejercicio, desarrollarás un programa en Python que explore un directorio especificado y liste todos los subdirectorios dentro de él. Este ejercicio es ...
Trabajar con fecha y hora
En este ejercicio, desarrollarás un programa en Python que trabaje con fechas y horas. Este ejercicio es perfecto para practicar la manipulación de fechas y ho...
Mostrar contenidos del directorio
En este ejercicio, desarrollarás un programa en Python que muestre el contenido de un directorio especificado. Este ejercicio es perfecto para practicar el man...
Listado de archivos ejecutables en un directorio
En este ejercicio, desarrollarás un programa en Python que liste todos los archivos ejecutables en un directorio especificado. Este ejercicio es perfecto para ...
Fecha y hora continuas
En este ejercicio, desarrollarás un programa en Python que muestre continuamente la fecha y hora actuales en tiempo real. Este ejercicio es perfecto para pract...

Generador de mapas del sitio V2 - Python Programming Exercise

Categoría

Uso de bibliotecas adicionales

Ejercicio

Generador De Mapas Del Sitio V2

Objectivo

Ejemplo de ejercicio de Python

Output

Comparte este ejercicio de Python

Más Ejercicios Programación Python de Uso de bibliotecas adicionales

Explorando un directorio

Explorando Subdirectorios

Trabajar con fecha y hora

Mostrar contenidos del directorio

Listado de archivos ejecutables en un directorio

Fecha y hora continuas