AI Data Specialist (Contract)
Ventures Platform
About Ventures Platform
Ventures Platform (VP) is a pan-African VC fund that invests early in mission-driven founders building capital-efficient platforms that democratise prosperity, plug infrastructural gaps, connect underrepresented communities, and improve livelihoods in Africa. Currently investing from a $46-million fund, we provide smart capital and growth support for Africa’s boldest entrepreneurs and are early investors in some of the most compelling technology companies on the African continent.
Our Thesis
We invest in companies that solve for non-consumption, plug infrastructural gaps and democratise prosperity in Africa, by eliminating the barriers to access and reducing the costs of delivering goods and services.
Dare. Do. Repeat
We’ve backed over 140 founders daring to redefine sectors from healthcare to logistics. They don’t just dare. They’re doers. And each day they start over again, driven by their commitment to shape Africa’s future. Notable ones include Piggyvest, Paystack, SeamlessHR, Reliance Health, Nomba, Verto and many more.
AI Data Specialist at Ventures Platform
We are seeking a highly motivated and detail-oriented AI Data Specialist to join our team on a contract basis. This is a foundational role focused on the critical first step in our LLM initiative: data extraction, cleaning, and preparation. You will be responsible for navigating diverse data silos within the firm, including email inboxes, cloud storage (Google Drive, etc.), internal databases, and external data sources. Your primary objective will be to extract, transform, and structure this raw data into a high-quality dataset suitable for fine-tuning open-source Large Language Models (LLMs).
This role is ideal for someone who is passionate about data, possesses strong technical skills, and understands the importance of data quality in the success of AI initiatives. You will work closely with our AI Research Lead and will have a direct impact on the firm's ability to leverage AI for strategic advantage. While the initial focus is data preparation, there is potential for growth and involvement in later stages of the LLM fine-tuning process and broader AI projects as the team evolves.
Core Responsibilities
- Data Discovery and Extraction:
- Identify and map diverse data sources across the firm, including but not limited to:
- Email inboxes (Gmail, Outlook, etc.) – extracting email threads, attachments (documents, spreadsheets, presentations).
- Cloud storage platforms (Google Drive, Dropbox, etc.) – identifying and accessing relevant folders and files.
- Internal databases (CRM, deal management systems, etc.) – querying and exporting relevant data.
- Document repositories (operational manuals, internal wikis, shared drives).
- External data sources (market research reports, financial data APIs, news archives, industry publications, stock exchange data feeds, SEC filings, patent databases, alternative data sources).
- Develop and implement efficient data extraction methods tailored to each data source. This may involve:
- Building scripts and tools for automated data extraction (Python, scripting languages).
- Utilizing APIs and connectors to access data from various platforms.
- Employing data scraping techniques (where permissible and ethical for publicly available data).
- Collaborating with internal IT and relevant stakeholders to gain necessary data access and permissions, while adhering to data security and privacy policies.
- Maintain a comprehensive inventory of data sources and extraction processes.
- Identify and map diverse data sources across the firm, including but not limited to:
- Data Cleaning and Preprocessing:
- Clean and standardize extracted data to ensure consistency and quality. This includes:
- Text cleaning: Removing noise, special characters, irrelevant formatting, HTML tags, etc.
- Data deduplication: Identifying and removing redundant data entries.
- Handling missing data: Developing strategies for dealing with incomplete or missing information (imputation, removal, etc.).
- Data normalization and standardization: Ensuring consistent data formats and units.
- Language detection and handling: Identifying and potentially translating text in different languages if required.
- Implement data validation and quality checks to identify and rectify data errors and inconsistencies.
- Develop and document data cleaning pipelines and scripts for reproducibility and scalability.
- Clean and standardize extracted data to ensure consistency and quality. This includes:
- Data Structuring and Preparation for LLM Fine-tuning:
- Transform unstructured and semi-structured data into structured formats suitable for LLM fine-tuning. This may involve:
- Document parsing and segmentation: Breaking down large documents into meaningful chunks (paragraphs, sections).
- Information extraction: Identifying key entities, relationships, and topics within the data.
- Creating structured datasets: Organizing data into formats like JSON, CSV, or text files with appropriate delimiters.
- Developing data schemas and ontologies to represent the data in a structured and meaningful way (if needed for advanced structuring).
- Prepare data for different fine-tuning methodologies. This might include:
- Creating instruction-following datasets: Structuring data into input-output pairs for instruction tuning.
- Preparing conversational datasets: Formatting data for dialogue-based fine-tuning.
- Organizing data for specific tasks: Structuring data for tasks like summarization, question answering, or classification, depending on the firm's objectives.
- Implement data augmentation techniques (if applicable and beneficial) to increase the size and diversity of the fine-tuning dataset.
- Ensure data privacy and security throughout the data preparation process, adhering to internal policies and relevant regulations. De-identification and anonymization techniques may be required for sensitive data.
- Transform unstructured and semi-structured data into structured formats suitable for LLM fine-tuning. This may involve:
- Collaboration and Documentation:
- Collaborate effectively with internal stakeholders (e.g., investment professionals, IT, compliance) to understand data needs, access requirements, and data governance policies.
- Document all data extraction, cleaning, and preparation processes clearly and comprehensively.
- Contribute to the development of data dictionaries and data lineage documentation.
- Communicate progress and challenges effectively to the AI Research Lead and relevant stakeholders.
- Stay up-to-date with the latest advancements in data preparation techniques for LLMs and NLP.
Qualifications:
- Education: Bachelor's or Master's degree in Computer Science, Data Science, Information Management, Natural Language Processing, or a related technical field.
- Experience:
- 3+ years of experience in data extraction, cleaning, and preparation for machine learning or data analysis projects.
- Experience working with diverse data sources and formats (emails, documents, databases, APIs, web data).
- Familiarity with cloud storage platforms (Google Drive, AWS S3, Azure Blob Storage, etc.).
- Technical Skills:
- Proficiency in Python and relevant data science libraries (Pandas, NumPy, scikit-learn, NLTK/SpaCy).
- Strong scripting skills for automating data extraction and processing tasks.
- Experience with database querying languages (SQL, NoSQL).
- Understanding of data cleaning and preprocessing techniques.
- Familiarity with data structuring and transformation methods.
- Basic understanding of Natural Language Processing (NLP) concepts and text data processing.
- Knowledge of data privacy and security best practices.
- Experience with version control systems (Git).
- Soft Skills:
- Excellent analytical and problem-solving skills.
- Strong attention to detail and commitment to data quality.
- Excellent communication and collaboration skills.
- Proactive and self-motivated with the ability to work independently and as part of a team.
- Ability to learn quickly and adapt to new technologies and data environments.
- Strong organizational and time management skills.
- Discretion and confidentiality when handling sensitive business data.
Bonus Points (Nice to Have):
- Experience specifically preparing data for Large Language Models (LLMs) or other NLP models.
- Familiarity with LLM fine-tuning methodologies (as described in the provided paper or generally).
- Experience with data annotation tools and techniques.
- Knowledge of financial markets, venture capital, or investment domains.
- Experience working with cloud-based data processing platforms (AWS, GCP, Azure).
- Contributions to open-source data science projects.
Core Expertise
- A robust network of founders in the region, ideally publicly recognized within the ecosystem and actively engaged in ecosystem events.
- Technical proficiency, including product/tech knowledge, familiarity with AI, and a unique vertical interest.
- Strong writing and analytical skills, with the ability to effectively communicate investment theses and insights.
- A creative and curious mindset, a proactive attitude, great energy, and a strong team spirit.
Core Attributes
Proficiency in:
- Early-stage startup/founder involvement: Showcasing prior engagement as a founder or pivotal early team member within a thriving VC, technology startup, illustrating direct participation in operational expansion and possessing specialised skills in scaling ventures.
- Substantial background in early-stage venture capital: Demonstrating leadership in investment initiatives, particularly in Pre-seed or Seed rounds, with a proven record of successful investments.
- Well-established networks within the regional entrepreneurial ecosystem, encompassing connections within the founder, venture capital, and angel investor communities, with a preference for cultivating relationships in the North/West Africa region
What We Offer:
- The opportunity to be at the forefront of AI innovation within a leading venture capital firm.
- A chance to play a critical role in a high-impact project with significant strategic importance.
- A collaborative and intellectually stimulating work environment.
- Potential for professional growth and development within a rapidly expanding AI team.
Ventures Platform’s Proposition
We are democratising prosperity in Africa through innovation and entrepreneurship: join us as we become Africa’s leading Venture Capital firm
- An opportunity to partner with the best founders long term to build transformative companies from day 1
- An opportunity to greatly impact the direction and success of Ventures Platform
- Competitive salary combined with incentives and a flexible work structure
- Learning and growth opportunities
If you meet these criteria, we would love to connect with you. Please submit your RESUME to HR@venturesplatform.com and let’s explore this opportunity together.