The _ variable refers to the current context object. When accessing properties like _['url'], it’s retrieving values from the input parameters passed to the task.
This step:
Takes the input URL and crawls the website
Processes content into readable markdown format
Chunks content into manageable segments
Filters out unnecessary elements like images and SVGs
2
Process and Index Content
- evaluate:document: $ _.documentchunks:| $ [" ".join(_.content[i:i + max(_.reducing_strength, len(_.content) // 9)]) for i in range(0, len(_.content), max(_.reducing_strength, len(_.content) // 9))]label: docs# Step 1: Create a new document and add it to the agent docs store-over: $ [(steps[0].input.document, chunk.strip()) for chunk in _.chunks]parallelism:3 map:prompt:-role: usercontent:>- $ f''' <document>{_[0]} </document> Here is the chunk we want to situate within the whole document <chunk>{_[1]} </chunk> Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else. '''unwrap:true settings:max_tokens:16000- evaluate:final_chunks:| $ [ NEWLINE.join([succint, chunk.strip()]) for chunk, succint in zip(steps['docs'].output.chunks, _)]
This step:
Processes each content chunk in parallel
Generates contextual metadata for improved retrieval
# yaml-language-server: $schema=https://raw.githubusercontent.com/julep-ai/julep/refs/heads/dev/schemas/create_task_request.jsonname: Julep Jina Crawler Taskdescription: A Julep agent that can crawl a website and store the content in the document store.########################################################################### INPUT SCHEMA ###############################################################################input_schema:type: object properties: url:type: string reducing_strength:type: integer########################################################################### TOOLS ######################################################################################tools:-name: get_pagetype: api_call api_call:method: GETurl: https://r.jina.ai/ headers:accept: application/jsonx-return-format: markdownx-with-images-summary:"true"x-with-links-summary:"true"x-retain-images:"none"x-no-cache:"true"Authorization:"Bearer JINA_API_KEY"-name: create_agent_docdescription: Create an agent doctype: system system:resource: agentsubresource: docoperation: create########################################################################### INDEX PAGE SUBWORKFLOW ######################################################################index_page:# Step #0 - Evaluate the content- evaluate:document: $ _.documentchunks:| $ [" ".join(_.content[i:i + max(_.reducing_strength, len(_.content) // 9)]) for i in range(0, len(_.content), max(_.reducing_strength, len(_.content) // 9))]label: docs# Step #1 - Process each content chunk in parallel-over: $ "[(steps[0].input.content, chunk) for chunk in _['documents']]"parallelism:3 map:prompt:-role: usercontent:>- $ f''' <document>{_[0]} </document> Here is the chunk we want to situate within the whole document <chunk>{_[1]} </chunk> Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.'''unwrap:true settings:max_tokens:16000# Step #2 - Create a new document and add it to the agent docs store- evaluate:final_chunks:| $ [ NEWLINE.join([chunk, succint]) for chunk, succint in zip(steps[1].input.documents, _)]# Step #3 - Create a new document and add it to the agent docs store-over: $ _['final_chunks']parallelism:3 map:tool: create_agent_doc arguments:agent_id:"00000000-0000-0000-0000-000000000000"# <--- This is the agent id of the agent you want to add the document to data: metadata:source:"jina_crawler"title:"Website Document"content: $ _########################################################################### MAIN WORKFLOW ##############################################################################main:# Step 0: Get the content of the product page-tool: get_page arguments:url: $ "https://r.jina.ai/" + steps[0].input.url# Step 1: Chunk the content- evaluate:result: $ chunk_doc(_.json.data.content.strip())# Step 2: Evaluate step to document chunks-workflow: index_page arguments:content: $ _.resultdocument: $ steps[0].output.json.data.content.strip()reducing_strength: $ steps[0].input.reducing_strength
Start by creating an execution for the task. This execution will make the agent crawl the website and store the content in the document store.
from julep import Clientimport timeimport yaml# Initialize the clientclient = Client(api_key=JULEP_API_KEY)# Create the agentagent = client.agents.create( name="Julep Jina Crawler Agent", description="A Julep agent that can crawl a website and store the content in the document store.",)# Load the task definitionwithopen('crawling_task.yaml','r')asfile: task_definition = yaml.safe_load(file)# Create the tasktask = client.tasks.create( agent_id=agent.id,**task_definition)# Create the executionexecution = client.executions.create( task_id=task.id,input={"url":"https://en.wikipedia.org/wiki/Artificial_intelligence, "reducing_strength":5})# Wait for the execution to completewhile(result := client.executions.get(execution.id)).status notin['succeeded','failed']:print(result.status) time.sleep(1)if result.status =="succeeded":print(result.output)else:print(f"Error: {result.error}")
Next, create a session for the agent. This session will be used to chat with the agent.