Technology
Leave a Comment

Finding links in Hebrew text

Jewish texts are interconnected. Nearly every Jewish text after the Bible is packed with references to other texts, both on-page and off-page. In a digital world, these links can be made more explicit and more informative. Links between texts open up opportunities for richer and broader study, and for analysis of the whole network of textual interconnections.

People have been writing Jewish indices for generations, but as it turns out, much of what used to take scholars with a encyclopedic memories years to complete can now be done by computers in minutes.

Last Thursday, a new script I wrote scanned Sefaria’s entire library of Hebrew texts looking for links between texts, and added another 32,000 links to our database.

We’re looking to find the links in our library of text wherever we can. We had started by finding the links in our English text – and they were relatively easy to pick out. They usually look something like this: Proverbs 10:12 or like this: Shabbat 15b. The patterns are pretty clear.  We know what texts we have in the system, and we can look for a text name followed by a number or a series of numbers.

In Hebrew, the puzzle gets more complicated, for two reasons. First, Hebrew text names are often simple words in the Hebrew language. The book of Exodus is named שמות, which literally means “names”; Deuteronomy is called דברים which means “words” or “things”; tractates of Talmud are named after the topics that they discuss. Second, Arabic numerals (0 through 9) are not generally used to designate numbers in Jewish texts. Hebrew letters serve not only as phonetic letters, but also as numerals. In looking for links in Hebrew text, we can’t rely on just looking for the name of a book followed by a number or two.

Perhaps because it is so easy to confuse Hebrew words and numbers, printers and online texts have usually set apart references by putting them in parenthesis. For our first round of scans, we’re counting on this – catching just the references that are found within parenthesis or braces. We also distinguish between those series of Hebrew letters that could possibly be numbers, and those that couldn’t be. Some Hebrew letters equate to single digit decimal numbers, and some equate to tens-place or hundreds-place numbers. If the letters aren’t in the correct order, we know it’s not a number. (By the way – when Sefaria translates back and forth between arabic numerals and Hebrew numbers, it’s relying on code from J.J. one of the volunteer contributors to our open-source code.)

We focused this round of link finding on links with Bible, Mishnah, and Talmud targets.  We keep track of how many chapters are in each Biblical book and Tractate of Mishnah, and how many pages are in each Talmud volume. If a candidate number is out of range, we know it’s not a reference.

If a candidate link passes all of these tests – being in parenthesis, with a good text name, followed by potential numbers that are in range, we can be pretty sure that we’re looking at a reference, and we store it in our database.

All in all, we added more than 18,000 links from Midrash,  about 4,500 from the Talmud (we already had 12,000 there, thanks to a link spotting script from William Herlands), 4,000 from commentaries on Bible and Talmud, 3500 from Halachic works, and an even thousand from Tosefta.  Another few hundred scattered around gives us the 32,000 that we added.  From here on out, any new links to Bible, Mishnah, or Talmud that are added to our Hebrew library will be automatically linked.

We still have a whole bunch of links that we’re not yet parsing.  Any links to anything other than Bible, Mishnah, and Talmud aren’t covered yet, there are some patterns that we see that we haven’t trained our code to see, and some links have forms that are difficult for a computer to spot.  As an example of the kind of thing that we’re not yet doing – the commentators of the middle ages referred to the Torah, the Talmud, and to each other’s work in ways that we no longer commonly use. They don’t use our pagination; the pagination of the Talmud is a relatively late phenomenon.  These commentators refer to the Talmud section by chapter name, and they often do it inline with the rest of their text, without giving us the clue of parenthesis.  We’re not catching them yet, but we’re aiming to get them soon.

There’s lots of links that we’re still not catching!  Tracking down sources and putting links into Sefaria is a great way to contribute.  If you’d like to help out in this way, drop us a line.  We can give you an idea of where the best link hunting territories are.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s