Format Your Old LinkedIn Content

Format Old LinkedIn Content

Format Your Old LinkedIn Content

This article explains how you can make your old content on LinkedIn more accessible. If you have posted a lot to LinkedIn, it can be very difficult to locate your old posts. The approach described in this article provides a potential workaround.

Last month I wrote about how you can access your old content on LinkedIn:

Basically, you use their UI to export it to Comma-Separated Values (CSV) files. That's functional, but not very usable, especially because LinkedIn doesn't encode CSV files correctly. You can have an AI analyze that content:

To format the content, you can run the files through an AI to generate markdown, HTML, JSON, or whatever format you need. Most people might like a static HTML best, since then you don't need a tool to visualize markdown; you can just open the file in a browser.

One caveat is that the export does not include media. The LinkedIn media export feature has been "under construction" for a couple of decades. Another is that formatting is not great, specifically around linefeeds, but there are probably some instructions I could give to address some of that. And some URLS are LinkedIn Klingon-cloaked so they can ensure your online safety. And by that I mean track you better. And usernames are not clickable. And there are no preview cards for YouTube and such. And you only see your content, including your posts and comments, not the posts you commented on or the comments to which you responded.

Below are the instructions that I used to generate the sample that appears in this page.

Claude prevented me from uploading CSV files. I appended .txt extensions to the file names, which Claude allowed me to upload. But by then I had run out of free credit for today. I had also run out of free credit on ChatGPT. I signed up for a free month of gemini, but I'm already disappointed, because it is easily confused (AI lacks common sense), and the UI was quirky today. It helped to give it instructions first and then upload the files afterwards. It doesn't let me download markdown files (it links to something like sandbox:/mnt/data/filename.md and blames my UI, which is google gemini running in google's chrome browser), and there's a limit to how much it can render inline, which is also very slow.

Anyway, here are my approximate prompts that I used:

These CSV files do not seem to be encoded, quoted, or escaped perfectly. Please correct any errors in the files before proceeding.

Shares.csv lists posts. Comments.csv lists comments on posts. Every comment is always associated with a post, but for some comments, there is no post record in the shares CSV. In both files, the URL may identify the post.

Generate a markdown file that I can download. If possible, I want a bulleted list with an entry for each post, including all the comments associated with that post. Each entry should include a link to the post. Use the datestamp as the text of the link. Order the list so that the oldest posts come first.

It generated the output below. I asked it to generate code that would let me do this locally, which seems to work. It generated this:

import pandas as pd
import re
import html
import os

def clean_share_content(raw_text):
    """
    Cleans the messy quoting from the Shares CSV content.
    """
    if not raw_text:
        return ""
    
    # 1. Fix paragraph breaks (often represented as """" or ""\n"")
    text = raw_text.replace('""""', '\n\n')
    
    # 2. Remove the specific double-quote wrappers around lines
    #    (The file seems to wrap lines in "")
    lines = text.split('\n')
    cleaned_lines = []
    for line in lines:
        # Strip leading/trailing quotes and spaces
        line = line.strip()
        if line.startswith('""') and line.endswith('""'):
            line = line[2:-2]
        elif line.startswith('"') and line.endswith('"'):
            line = line[1:-1]
        
        # Fix escaped quotes (csv usually does "" for ")
        line = line.replace('""', '"')
        cleaned_lines.append(line)
    
    text = "\n".join(cleaned_lines)
    
    # 3. HTML Escape (Fixes <title> and other tag issues)
    text = html.escape(text)
    
    return text

def parse_shares_robust(filename):
    """
    Parses Shares.csv line-by-line to avoid CSV parser crashes.
    """
    posts = []
    
    if not os.path.exists(filename):
        print(f"File not found: {filename}")
        return pd.DataFrame()

    with open(filename, 'r', encoding='utf-8', errors='replace') as f:
        lines = f.readlines()

    current_lines = []
    
    # Helper to process a buffered block of lines into a single post
    def process_buffer(buffer):
        if not buffer:
            return
        
        # Line 1 contains Date, Link, and start of content
        # Format: Date,Link,Content...
        first_line = buffer[0]
        parts = first_line.split(',', 2)
        
        if len(parts) < 2:
            return # Skip malformed lines
        
        date = parts[0].strip()
        link = parts[1].strip()
        
        content_parts = []
        
        # Handle content on the first line
        if len(parts) > 2:
            # If this is a single-line entry, strip the trailing CSV fields
            # (SharedUrl, MediaUrl, Visibility) -> usually 3 commas from end
            if len(buffer) == 1:
                content_part = parts[2].rsplit(',', 3)[0]
            else:
                content_part = parts[2] # Take everything, cleanup later
            content_parts.append(content_part)
        
        # Handle middle lines
        for i in range(1, len(buffer) - 1):
            content_parts.append(buffer[i])
            
        # Handle last line (if multi-line)
        if len(buffer) > 1:
            last_line = buffer[-1]
            # Strip trailing CSV fields from the last line
            content_part = last_line.rsplit(',', 3)[0]
            content_parts.append(content_part)
            
        full_raw_content = "\n".join(content_parts)
        clean_content = clean_share_content(full_raw_content)
        
        posts.append({
            'Date': date,
            'Link': link,
            'Content': clean_content
        })

    # Regex to identify the start of a new record (YYYY-MM-DD HH:MM:SS)
    date_pattern = re.compile(r'^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},')

    # Skip header
    start_idx = 1
    
    for line in lines[start_idx:]:
        if date_pattern.match(line):
            # New post found, process previous buffer
            process_buffer(current_lines)
            current_lines = [line]
        else:
            # Continuation of previous post
            current_lines.append(line)
            
    # Process final buffer
    process_buffer(current_lines)
    
    return pd.DataFrame(posts)

def main():
    print("--- Starting Processing ---")
    
    # 1. Parse Shares (Robustly)
    print("Reading Shares.csv...")
    df_shares = parse_shares_robust('Shares.csv')
    print(f"Parsed {len(df_shares)} posts from Shares.csv")

    # 2. Parse Comments (Standard pandas usually works here, but we escape HTML)
    print("Reading Comments.csv...")
    try:
        df_comments = pd.read_csv('Comments.csv', escapechar='\\')
        # Escape HTML in comments too
        df_comments['Message'] = df_comments['Message'].astype(str).apply(html.escape)
    except Exception as e:
        print(f"Error reading Comments.csv: {e}")
        return

    # 3. Merge Data
    print("Merging datasets...")
    # Identify missing posts
    share_links = set(df_shares['Link'])
    comment_links = set(df_comments['Link'])
    missing_links = comment_links - share_links
    
    print(f"Found {len(missing_links)} posts referenced in comments but missing from shares.")

    missing_posts = []
    for link in missing_links:
        link_comments = df_comments[df_comments['Link'] == link]
        earliest_date = link_comments['Date'].min()
        missing_posts.append({
            'Date': earliest_date,
            'Link': link,
            'Content': "Post content not available."
        })
    
    if missing_posts:
        df_missing = pd.DataFrame(missing_posts)
        all_posts = pd.concat([df_shares, df_missing], ignore_index=True)
    else:
        all_posts = df_shares

    # 4. Sort
    all_posts['DateDT'] = pd.to_datetime(all_posts['Date'], errors='coerce')
    all_posts = all_posts.sort_values('DateDT', ascending=True)

    # 5. Generate Markdown
    print("Generating Markdown file...")
    output_file = 'LinkedIn_Export_Final.md'
    
    comments_by_link = df_comments.groupby('Link')
    
    with open(output_file, 'w', encoding='utf-8') as f:
        for _, row in all_posts.iterrows():
            date_str = str(row['Date'])
            link = row['Link']
            content = row['Content']
            
            # Post Header
            f.write(f"* [{date_str}]({link})\n")
            
            # Post Content
            if content and content.lower() != 'nan':
                for line in content.split('\n'):
                    if line.strip():
                        f.write(f"    > {line}\n")
            
            # Comments
            if link in comments_by_link.groups:
                post_comments = comments_by_link.get_group(link).sort_values('Date')
                f.write("    * **Comments:**\n")
                for _, c_row in post_comments.iterrows():
                    c_date = c_row['Date']
                    c_msg = c_row['Message']
                    # Indent multi-line comments
                    c_msg_indented = c_msg.replace('\n', '\n        ')
                    f.write(f"        * {c_date}: {c_msg_indented}\n")

    print(f"Success! Saved to {os.path.abspath(output_file)}")

if __name__ == "__main__":
    main()

AI Session

This is a link to the relevent AI session:

Comments

You can comment here:

Sample Output

Here's my partial data that gemini was able to give me.