WEB SCRAPING: When web scraping, you MUST use a proxy to comply with our terms of service. Direct scraping of third-party websites without the site owner’s permission using Trigger.dev Cloud is prohibited and will result in account suspension. See this example which uses a proxy.
Click here to view the full code for this project in our examples repository on GitHub. You can
fork it and use it as a starting point for your own project.
This task uses the python.runScript method to run the crawl-url.py script with the given URL as an argument. You can see the original task in our examples repository here.
src/trigger/pythonTasks.ts
import { logger, schemaTask, task } from "@trigger.dev/sdk";import { python } from "@trigger.dev/python";import { z } from "zod";export const convertUrlToMarkdown = schemaTask({ id: "convert-url-to-markdown", schema: z.object({ url: z.string().url(), }), run: async (payload) => { // Pass through any proxy environment variables const env = { PROXY_URL: process.env.PROXY_URL, PROXY_USERNAME: process.env.PROXY_USERNAME, PROXY_PASSWORD: process.env.PROXY_PASSWORD, }; const result = await python.runScript("./src/python/crawl-url.py", [payload.url], { env }); logger.debug("convert-url-to-markdown", { url: payload.url, result, }); return result.stdout; },});
The Python script is a simple script using Crawl4AI that takes a URL and returns the markdown content of the page. You can see the original script in our examples repository here.
src/python/crawl-url.py
import asyncioimport sysimport osfrom crawl4ai import *from crawl4ai.async_configs import BrowserConfigasync def main(url: str): # Get proxy configuration from environment variables proxy_url = os.environ.get("PROXY_URL") proxy_username = os.environ.get("PROXY_USERNAME") proxy_password = os.environ.get("PROXY_PASSWORD") # Configure the proxy browser_config = None if proxy_url: if proxy_username and proxy_password: # Use authenticated proxy proxy_config = { "server": proxy_url, "username": proxy_username, "password": proxy_password } browser_config = BrowserConfig(proxy_config=proxy_config) else: # Use simple proxy browser_config = BrowserConfig(proxy=proxy_url) else: browser_config = BrowserConfig() async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url=url, ) print(result.markdown)if __name__ == "__main__": if len(sys.argv) < 2: print("Usage: python crawl-url.py <url>") sys.exit(1) url = sys.argv[1] asyncio.run(main(url))
Activate the virtual environment, depending on your OS: On Mac/Linux: source venv/bin/activate, on Windows: venv\Scripts\activate
Install the Python dependencies pip install -r requirements.txt
If you haven’t already, copy your project ref from your Trigger.dev dashboard and add it to the trigger.config.ts file.
Run the Trigger.dev CLI dev command (it may ask you to authorize the CLI if you haven’t already).
Test the task in the dashboard, using a URL of your choice.
WEB SCRAPING: When web scraping, you MUST use a proxy to comply with our terms of service. Direct scraping of third-party websites without the site owner’s permission using Trigger.dev Cloud is prohibited and will result in account suspension. See this example which uses a proxy.