Dix Hospital Ledger data cleaning and basic descriptives

This datalab notebook describes the process of cleaning the Dix Intake Ledger, from Raleigh, North Carolina.
The code was written in Stata MP v16.

By: Nabarun Dasgupta (nab@unc.edu)



Import

In [1]:
display "Notebook generated on $S_DATE at $S_TIME ET"
Notebook generated on 24 Feb 2020 at 15:05:33 ET
In [2]:
cd "/Users/nabarun/Dropbox/Projects/Dix Park Intake/"
use DixLedgerDeidentified_clean, clear
qui: describe, f
/Users/nabarun/Dropbox/Projects/Dix Park Intake



Variable Construction

In [3]:
// Space for exploratory variable creation
* gen VAR = regexm(lower(TEXT),"token|token")
* table year if war==1, c(sum war) col

Univariate Exploration

Dates of Admission and Discharge

In [4]:
graph dot (sum) counter, over(decade) vertical title("Number of Patients Admitted") ytitle("Number of Admissions by Decade") graphregion(color(white)) bgcolor(white) scale(1.4)
mdesc decade
tab decade




    Variable    |     Missing          Total     Percent Missing
----------------+-----------------------------------------------
         decade |          31          7,479           0.41
----------------+-----------------------------------------------


  Decade of |
  admission |      Freq.     Percent        Cum.
------------+-----------------------------------
      1850s |        340        4.56        4.56
      1860s |        541        7.26       11.83
      1870s |        430        5.77       17.60
      1880s |        762       10.23       27.83
      1890s |      1,274       17.11       44.94
      1900s |      1,708       22.93       67.87
      1910s |      2,393       32.13      100.00
------------+-----------------------------------
      Total |      7,448      100.00

The number of patients started to climb in the 1880s and increased substantially for the next decades.


In [5]:
graph dot (sum) counter, over(dayofweek) vertical title("Day of Week of Admission") ytitle("Number of Admissions") graphregion(color(white)) bgcolor(white) scale(1.4)
mdesc dayofweek




    Variable    |     Missing          Total     Percent Missing
----------------+-----------------------------------------------
      dayofweek |          31          7,479           0.41
----------------+-----------------------------------------------
In [6]:
tab admitmonth pellagra
  Month of |       pellagra
 admission |         0          1 |     Total
-----------+----------------------+----------
         J |       575         13 |       588 
         F |       517          7 |       524 
         M |       580          7 |       587 
         A |       830         17 |       847 
         M |       660         17 |       677 
         J |       598         11 |       609 
         J |       571         20 |       591 
         A |       608         14 |       622 
         S |       608         14 |       622 
         O |       524         10 |       534 
         N |       608         21 |       629 
         D |       609          9 |       618 
-----------+----------------------+----------
     Total |     7,288        160 |     7,448 

Admissions peaked on Tuesdays and were lowest on Sunday.



In [7]:
graph dot (sum) counter, over(admitmonth) vertical title("Month of Admission") ytitle("Number of Admissions") graphregion(color(white)) bgcolor(white) scale(1.4)
mdesc admitmonth




    Variable    |     Missing          Total     Percent Missing
----------------+-----------------------------------------------
     admitmonth |          31          7,479           0.41
----------------+-----------------------------------------------

April was the month with the most admissions.



Age

Histogram of age distribution at time of admission

In [8]:
* Age at intake histogram
hist age, width(5) freq graphregion(color(white)) bgcolor(white) note("Caution: missing age in `miss' (`pct'%) of patients")
(bin=18, start=0, width=5)

In [9]:
bysort decade: summ age
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> decade = 1850s

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |        301    35.15282    11.13178         17         67

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> decade = 1860s

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |        476    35.83403    12.96375         13         85

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> decade = 1870s

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |        427    34.55738    12.25773         12         78

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> decade = 1880s

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |        751    38.11917    13.27175          8         81

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> decade = 1890s

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |      1,252    38.49281    13.85457          7         83

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> decade = 1900s

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |      1,662     39.4284    14.26474          8         84

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> decade = 1910s

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |      2,359    39.25011    15.50522          0         90

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> decade = .

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |          3    51.66667    25.14624         23         70

In [10]:
graph dot (mean) age, over(decade) vertical title("Mean Age at Admission") ytitle("Age in Years") graphregion(color(white)) bgcolor(white) scale(1.4)