<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>SEER on Dual Brain Lab</title><link>https://csilab.net/en/tags/seer/</link><description>Recent content in SEER on Dual Brain Lab</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Thu, 16 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://csilab.net/en/tags/seer/index.xml" rel="self" type="application/rss+xml"/><item><title>50 Years of Medical Databases: From 1973 to the AI Era</title><link>https://csilab.net/en/p/medical-databases-50years/</link><pubDate>Thu, 16 Apr 2026 00:00:00 +0000</pubDate><guid>https://csilab.net/en/p/medical-databases-50years/</guid><description>&lt;div class="video-wrapper" style="position: relative; width: 100%; padding-bottom: 56.25%; margin: 1.5rem 0;">
 &lt;iframe
 src="https://www.youtube.com/embed/Ewf0_Ckc07A"
 title="YouTube video player"
 frameborder="0"
 allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
 allowfullscreen
 style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border-radius: 8px;">
 &lt;/iframe>
&lt;/div>

&lt;p>Every day, thousands of researchers publish papers using public databases. But few know where these databases actually came from.&lt;/p>
&lt;p>Today, I&amp;rsquo;ll walk you through fifty years of history, from 1973 to now. By the end, you&amp;rsquo;ll know exactly where to start.&lt;/p>
&lt;h2 id="1973--where-it-all-begins-seer">1973 — Where it all begins: SEER
&lt;/h2>&lt;p>The story begins in 1973.&lt;/p>
&lt;p>That year, Nixon signed the National Cancer Act. America declared war on cancer, but you can&amp;rsquo;t fight a war without knowing where the enemy is. So the NCI built &lt;strong>SEER&lt;/strong>, the first national cancer registry. From that point on, every cancer patient&amp;rsquo;s diagnosis, staging, treatment, and survival was recorded.&lt;/p>
&lt;p>Fifty years later, SEER covers half the US population and has directly produced over 17,000 papers. To this day, it remains the number one data source for cancer epidemiology.&lt;/p>
&lt;h2 id="1989-1999--seeds-being-planted">1989-1999 — Seeds being planted
&lt;/h2>&lt;p>Over the next twenty years, more seeds were planted.&lt;/p>
&lt;p>&lt;strong>NHANES&lt;/strong> (started 1971, became a continuous biennial national health survey by 1999) — the king of cross-sectional studies, cited over 60,000 times on PubMed.&lt;/p>
&lt;p>&lt;strong>CHNS&lt;/strong> (1989) — China Health and Nutrition Survey, jointly launched by UNC Chapel Hill and the Chinese CDC. It tracked health and nutrition changes across China, and was one of the first Chinese health databases open to the world.&lt;/p>
&lt;p>&lt;strong>PhysioNet&lt;/strong> (1999) — MIT launched a platform dedicated to sharing clinical data. MIMIC, eICU, PIC — all the databases you&amp;rsquo;ve heard of, they&amp;rsquo;re all hosted there. PhysioNet isn&amp;rsquo;t a database. &lt;strong>It&amp;rsquo;s infrastructure.&lt;/strong>&lt;/p>
&lt;h2 id="2000-2010--three-forces-driving-the-explosion">2000-2010 — Three forces driving the explosion
&lt;/h2>&lt;p>After 2000, things accelerated. Three forces pushed at the same time.&lt;/p>
&lt;p>&lt;strong>First, legislation.&lt;/strong> ClinicalTrials.gov went online, and by 2007 the law required all clinical trials to be registered. It now has half a million trials, and AACT turned all that data into a queryable database.&lt;/p>
&lt;p>&lt;strong>Second, journal mandates.&lt;/strong> Nature said: you want to publish? First deposit your gene expression data in GEO. Data submission became a prerequisite for publication. GEO now has 200,000 studies and 6.5 million samples.&lt;/p>
&lt;p>&lt;strong>Third, technology breakthroughs.&lt;/strong> High-throughput sequencing made data explode. In 2006, TCGA launched, profiling 33 cancer types with multi-omics data. Over 20,000 samples, 2.5 petabytes, mentioned in over 29,000 PubMed papers. It completely changed cancer classification from histology to molecular subtypes.&lt;/p>
&lt;p>During this period, CTD, the precursor to DepMap (CCLE), and the EPA air quality system all came online.&lt;/p>
&lt;h2 id="2015--the-imagenet-moment-for-clinical-ai">2015 — The ImageNet moment for clinical AI
&lt;/h2>&lt;p>In 2015, something game-changing happened.&lt;/p>
&lt;p>MIT released &lt;strong>MIMIC-III&lt;/strong>. 60,000 ICU admissions, fully de-identified, freely available worldwide.&lt;/p>
&lt;p>This database has been cited nearly 8,000 times. It&amp;rsquo;s been called &lt;strong>the ImageNet of clinical AI&lt;/strong> — just as ImageNet sparked the boom in computer vision, MIMIC sparked an explosion in clinical prediction models.&lt;/p>
&lt;p>ICU mortality prediction, early sepsis warning, mechanical ventilation management. Nearly every critical care AI study started with MIMIC.&lt;/p>
&lt;p>Then in 2018, Philips released &lt;strong>eICU&lt;/strong>, 200,000 multi-center ICU records. With eICU, MIMIC models could finally be externally validated.&lt;/p>
&lt;h2 id="2019--china-enters">2019 — China enters
&lt;/h2>&lt;p>In 2019, China started building its own databases.&lt;/p>
&lt;p>Children&amp;rsquo;s Hospital of Zhejiang University released &lt;strong>PIC&lt;/strong>, the world&amp;rsquo;s first publicly available pediatric ICU database. Over 13,000 admissions, 12,000 patients, hosted on PhysioNet.&lt;/p>
&lt;p>This meant China went from being a &lt;strong>user&lt;/strong> of public databases to a &lt;strong>builder&lt;/strong>.&lt;/p>
&lt;p>Around the same time, the Broad Institute launched &lt;strong>DepMap&lt;/strong>. It uses CRISPR for genome-wide screening, telling you which genes each cancer truly depends on. From describing mutations to functional validation. Another paradigm shift.&lt;/p>
&lt;h2 id="2021-to-now--designed-for-ai">2021 to now — Designed for AI
&lt;/h2>&lt;p>By 2021, the direction shifted again.&lt;/p>
&lt;p>&lt;strong>MIMIC-IV&lt;/strong> was released — no longer just ICU data. It added 200,000 emergency department records. From an ICU database to a complete acute care database.&lt;/p>
&lt;p>&lt;strong>CarpeDiem&lt;/strong> opened in 2023, with 44 clinical parameters per day, simulating what you see on daily rounds.&lt;/p>
&lt;p>&lt;strong>NWICU&lt;/strong> opened in 2024, 12 hospitals, 28,000 ICU stays, aligned with MIMIC-IV&amp;rsquo;s data structure.&lt;/p>
&lt;p>Notice the pattern? These new databases aren&amp;rsquo;t designed for humans to browse. &lt;strong>They&amp;rsquo;re designed for machine learning.&lt;/strong> Standardized structures, cross-database validation built in.&lt;/p>
&lt;p>At the same time, imaging data exploded: &lt;strong>CheXpert&lt;/strong> with 200,000 chest X-rays, &lt;strong>PTB-XL&lt;/strong> with 20,000 ECGs, &lt;strong>Kvasir&lt;/strong> with endoscopy images, &lt;strong>CAMELYON&lt;/strong> with whole-slide pathology. Almost every imaging direction now has a public dataset.&lt;/p>
&lt;h2 id="this-path-ive-already-walked-it">This path, I&amp;rsquo;ve already walked it
&lt;/h2>&lt;p>So after all this history, what does it mean for you? Everything.&lt;/p>
&lt;p>Fifty years ago, doing research required your own data. No lab, no cohort, you couldn&amp;rsquo;t do anything.&lt;/p>
&lt;p>Today is different. Dozens of public databases are freely available. From critical care to oncology, from epidemiology to genomics, from ECGs to pathology slides.&lt;/p>
&lt;p>And databases can be &lt;strong>combined&lt;/strong>. MIMIC for modeling plus eICU for validation. SEER clinical data plus TCGA molecular data. NHANES as primary analysis plus CHNS for cross-population validation. These combination strategies are the real secret to publishing with public databases.&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Your direction&lt;/th>
 &lt;th>Recommended combo&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>Beginners&lt;/td>
 &lt;td>NHANES or SEER&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Critical care&lt;/td>
 &lt;td>MIMIC + eICU&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>Oncology&lt;/td>
 &lt;td>SEER + TCGA + GEO&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;p>All of these databases are free. Most of them just require signing up on PhysioNet and agreeing to the data use terms.&lt;/p>
&lt;p>This path, I&amp;rsquo;ve already walked it.&lt;/p></description></item></channel></rss>