Repository: woahdae/simple_xlsx_reader Branch: master Commit: 22d783c88a75 Files: 38 Total size: 92.0 KB Directory structure: gitextract_v1o319sx/ ├── .github/ │ ├── dependabot.yml │ └── workflows/ │ └── ruby.yml ├── .gitignore ├── .travis.yml ├── CHANGELOG.md ├── Gemfile ├── LICENSE.txt ├── README.md ├── Rakefile ├── lib/ │ ├── simple_xlsx_reader/ │ │ ├── document.rb │ │ ├── hyperlink.rb │ │ ├── loader/ │ │ │ ├── shared_strings_parser.rb │ │ │ ├── sheet_parser.rb │ │ │ ├── style_types_parser.rb │ │ │ └── workbook_parser.rb │ │ ├── loader.rb │ │ └── version.rb │ └── simple_xlsx_reader.rb ├── simple_xlsx_reader.gemspec └── test/ ├── chunky_utf8.xlsx ├── date1904.xlsx ├── date1904_test.rb ├── datetime_test.rb ├── datetimes.xlsx ├── gdocs_sheet.xlsx ├── gdocs_sheet_test.rb ├── lower_case_sharedstrings.xlsx ├── lower_case_sharedstrings_test.rb ├── misc_numbers.xlsx ├── namespaces_and_missing_atts_test.rb ├── percentages_n_currencies.xlsx ├── performance_test.rb ├── sesame_street_blog.xlsx ├── shared_strings.xml ├── simple_xlsx_reader_test.rb ├── styles.xml ├── test_helper.rb └── test_xlsx_builder.rb ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/dependabot.yml ================================================ version: 2 updates: - package-ecosystem: "github-actions" directory: "/" schedule: interval: "weekly" ================================================ FILE: .github/workflows/ruby.yml ================================================ # This workflow uses actions that are not certified by GitHub. # They are provided by a third-party and are governed by # separate terms of service, privacy policy, and support # documentation. # This workflow will download a prebuilt Ruby version, install dependencies and run tests with Rake # For more information see: https://github.com/marketplace/actions/setup-ruby-jruby-and-truffleruby name: Ruby on: push: branches: [ "master" ] pull_request: branches: [ "master" ] permissions: contents: read jobs: test: runs-on: ubuntu-latest strategy: matrix: ruby-version: ['2.6', '2.7', '3.0', '3.1', '3.2', '3.3'] steps: - uses: actions/checkout@v4 - name: Set up Ruby uses: ruby/setup-ruby@v1 with: ruby-version: ${{ matrix.ruby-version }} bundler-cache: true # runs 'bundle install' and caches installed gems automatically - name: Run tests run: bundle exec rake ================================================ FILE: .gitignore ================================================ *.gem *.rbc .bundle .config .yardoc Gemfile.lock InstalledFiles _yardoc coverage doc/ lib/bundler/man pkg rdoc spec/reports test/tmp test/version_tmp tmp ================================================ FILE: .travis.yml ================================================ language: ruby cache: bundler before_install: - gem update bundler rvm: - 2.5.8 - 2.7.2 - 3.0.0 ================================================ FILE: CHANGELOG.md ================================================ ### 5.1.0 * Parse sheets containing namespaces and no 'r' att (@skipchris) * Fix Zlib error when loading from string (@myabc) * Prevent a SimpleXlsxReader::CellLoadError (no implicit conversion of Integer into String) when the casted value (friendly name) is not a string (@tsdbrown) * Accidental 25% perfarmance improvement while experimenting with namespace support (see #53f5a9). ### 5.0.0 * Change SimpleXlsxReader::Hyperlink to default to the visible cell value instead of the hyperlink URL, which in the case of mailto hyperlinks is surprising. * Fix blank content when parsing docs from string (@codemole) ### 4.0.1 * Fix nil error when handling some inline strings Inline strings are almost exclusively used by non-Excel XLSX implementations, but are valid, and sometimes have nil chunks. Also, inline strings weren't preserving whitespace if Nokogiri is parsing the string in chunks, as it does when encountering escaped characters. Fixed. ### 4.0.0 * Fix percentage rounding errors. Previously we were dividing by 100, when we actually don't need to, so percentage types were 100x too small. Fixes #21. Major bump because workarounds might have been implemented for previous incorrect behavior. * Fix small oddity in one currency format where round numbers would be cast to an integer instead of a float. ### 3.0.1 * Fix parsing "chunky" UTF-8 workbooks. Closes issues #39 and #45. See ce67f0d4. ### 3.0.0 * Change the way we typecast cells in the General format. This probably won't break anything in your app, but it's a change in behavior that theoretically could. Previously, we were treating cells using General the format as strings, when according to the Office XML standard, they should be treated as numbers. We now attempt to cast such cells as numbers, and fall back to strings if number casting fails. Thanks @jrodrigosm ### 2.0.1 * Restore ability to parse IO strings (@robbevp) * Add Ruby 3.1 and 3.2 to CI (@taichi-ishitani) ### 2.0.0 * SPEED * Reimplement internals in terms of a SAX parser * Change `SimpleXlsxReader::Sheet#rows` to be a `RowsProxy` that streams `#each` * Convenience - use `rows#each(headers: true)` to get header names while enumerating rows ### 1.0.5 * Support string or io input via `SimpleXlsxReader#parse` (@kalsan, @til) ### 1.0.4 * Fix Windows + RubyZip 1.2.1 bug preventing files from being read * Add ability to parse hyperlinks * Support files exported from Google Docs (@Strnadj) ### 1.0.3 Broken on Ruby 1.9; yanked. ### 1.0.2 * Fix Ruby 1.9.3-specific bug preventing parsing most sheets [middagj, eritiro] * Better support for non-excel-generated xlsx files [bwlang] * You don't always have a numFmtId column, and that's OK * Sometimes 'sharedStrings.xml' can be 'sharedstrings.xml' * Fixed parsing times very close to 12/30/1899 [Valeriy Utyaganov] * Be more flexible with custom formats using a numFmtId < 164 ### 1.0.1 * Add support for the 1904 date system [zilverline] ### 1.0.0 No changes since 1.0.0.pre. Releasing 1.0.0 since the project has seen a few months of stability in terms of bug fix requests, and the API is not going to change. ### 1.0.0.pre * Handle files with blank rows [Brian Hoffman] * Preserve seconds when casting datetimes [Rob Newbould] * Preserve empty rows (previously would be ommitted) * Speed up parsing by ~55% ### 0.9.8 * Rubyzip 1.0 compatability ### 0.9.7 * Fix cell parsing where cells have a type, but no content * Add a speed test; parsing performs in linear time, but a relatively slow line :/ ### 0.9.6 * Fix worksheet indexes when worksheets have been deleted ### 0.9.5 * Fix inlineStr support (broken by formula support commit) ### 0.9.4 * Formula support. Formulas used to cause things to blow up, now they don't! * Support number types styled as dates. Previously, the type was honored above the style, which is incorrect for dates; date-numbers now parse as dates. * Error-free parsing of empty sheets * Fix custom styles w/ numFmtId == 164. Custom style types are delineated starting *at* numFmtId 164, not greater than 164. ### 0.9.3 * Support 1.8.7 (tests pass). Ongoing support will depend on ease. ### 0.9.2 * Support reading files written by ex. simple_xlsx_writer that don't specify sheet dimensions explicitly (which Excel does). ### 0.9.1 * Fixed an important parse bug that ignored empty 'Generic' cells ### 0.9.0 * Initial release. 0.9 version number is meant to reflect the near-stable public api, yet still prerelease status of the project. ================================================ FILE: Gemfile ================================================ source 'https://rubygems.org' # Specify your gem's dependencies in simple_xlsx_reader.gemspec gemspec ================================================ FILE: LICENSE.txt ================================================ Copyright (c) 2013 Woody Peterson MIT License Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # SimpleXlsxReader A [fast](#performance) xlsx reader for Ruby that parses xlsx cell values into plain ruby primitives and dates/times. This is *not* a rewrite of excel in Ruby. Font styles, for example, are parsed to determine whether a cell is a number or a date, then forgotten. We just want to get the data, and get out! ## Summary (now with stream parsing): ```ruby doc = SimpleXlsxReader.open('/path/to/workbook.xlsx') doc.sheets # => [<#SXR::Sheet>, ...] doc.sheets.first.name # 'Sheet1' rows = doc.sheets.first.rows # rows.each # an ready to chain or stream rows.each {} # Streams the rows to your block rows.each(headers: true) {} # Streams row-hashes rows.each(headers: {id: /ID/}) {} # finds & maps headers, streams rows.slurp # Slurps rows into memory as a 2D array ``` That's the gist of it! See also the [Document](https://github.com/woahdae/simple_xlsx_reader/blob/2.0.0-pre/lib/simple_xlsx_reader/document.rb) object. ## Why? ### Accurate This project was started years ago, primarily because other Ruby xlsx parsers didn't import data with the correct types. Numbers as strings, dates as numbers, [hyperlinks](https://github.com/woahdae/simple_xlsx_reader/blob/master/lib/simple_xlsx_reader/hyperlink.rb) with inaccessible URLs, or - subtly buggy - simple dates as DateTime objects. If your app uses a timezone offset, depending on what timezone and what time of day you load the xlsx file, your dates might end up a day off! SimpleXlsxReader understands all these correctly. ### Idiomatic Many Ruby xlsx parsers seem to be inspired more by Excel than Ruby, frankly. SimpleXlsxReader strives to be fairly idiomatic Ruby: ```ruby # quick example having fun w/ ruby doc = SimpleXlsxReader.open(file_path) # or SimpleXlsxReader.parse(string_or_io) doc.sheets.first.rows.each(headers: {id: /ID/}) .with_index.with_object({}) do |(row, index), acc| acc[row[:id]] = index end ``` ### Now faster Finally, as of v2.0, SimpleXlsxReader is the fastest and most memory-efficient parser. Previously this project couldn't reasonably load anything over ~10k rows. Other parsers could load 100k+ rows, but were still taking ~1gb RSS to do so, even "streaming," which seemed excessive. So a SAX implementation was born. See [performance](#performance) for details. ## Usage ### Streaming SimpleXlsxReader is performant by default - If you use `rows.each {|row| ...}` it will stream the XLSX rows to your block without loading either the sheet XML or the full sheet data into memory. You can also chain `rows.each` with other Enumerable functions without triggering a slurp, and you have lots of ways to find and map headers while streaming. If you had an excel sheet representing this data: ``` | Hero ID | Hero Name | Location | | 13576 | Samus Aran | Planet Zebes | | 117 | John Halo | Ring World | | 9704133 | Iron Man | Planet Earth | ``` Get a handle on the rows proxy: ```ruby rows = SimpleXlsxReader.open('suited_heroes.xlsx').sheets.first.rows ``` Simple streaming (kinda boring): ```ruby rows.each { |row| ... } ```` Streaming with headers, and how about a little enumerable chaining: ```ruby # Map of hero names by ID: { 117 => 'John Halo', ... } rows.each(headers: true).with_object({}) do |row, acc| acc[row['Hero ID']] = row['Hero Name'] end ``` Sometimes though you have some junk at the top of your spreadsheet: ``` | Unofficial Report | | | | Dont tell Nintendo | Yes "John Halo" I know | | | | | | | Hero ID | Hero Name | Location | | 13576 | Samus Aran | Planet Zebes | | 117 | John Halo | Ring World | | 9704133 | Iron Man | Planet Earth | ``` For this, `headers` can be a hash whose keys replace headers and whose values help find the correct header row: ```ruby # Same map of hero names by ID: { 117 => 'John Halo', ... } rows.each(headers: {id: /ID/, name: /Name/}).with_object({}) do |row, acc| acc[row[:id]] = row[:name] end ``` If your header-to-attribute mapping is more complicated than key/value, you can do the mapping elsewhere, but use a block to find the header row: ```ruby # Example roughly analogous to some production code mapping a single spreadsheet # across many objects. Might be a simpler way now that we have the headers-hash # feature. object_map = { Hero => { id: 'Hero ID', name: 'Hero Name', location: 'Location' } } HEADERS = ['Hero ID', 'Hero Name', 'Location'] rows.each(headers: ->(row) { (HEADERS & row).any? }) do |row| object_map.each_pair do |klass, attribute_map| attributes = attribute_map.each_pair.with_object({}) do |(key, header), attrs| attrs[key] = row[header] end klass.new(attributes) end end ``` ### Slurping To make SimpleXlsxReader rows act like an array, for use with legacy SimpleXlsxReader apps or otherwise, we still support slurping the whole array into memory. The good news is even when doing this, the xlsx worksheet & shared string files are never loaded as a (big) Nokogiri doc, so that's nice. By default, to prevent accidental slurping, `` will throw an exception if you try to access it with array methods like `[]` and `shift` without explicitly slurping first. You can slurp either by calling `rows.slurp` or globally by setting `SimpleXlsxReader.configuration.auto_slurp = true`. Once slurped, enumerable methods on `rows` will use the slurped data (i.e. not re-parse the sheet), and those Array-like methods will work. We don't support all Array methods, just the few we have used in real projects, as we transition towards streaming instead. ### Load Errors By default, cell load errors (ex. if a date cell contains the string 'hello') result in a SimpleXlsxReader::CellLoadError. If you would like to provide better error feedback to your users, you can set `SimpleXlsxReader.configuration.catch_cell_load_errors = true`, and load errors will instead be inserted into Sheet#load_errors keyed by [rownum, colnum]: ```ruby { [rownum, colnum] => '[error]' } ``` ### Performance SimpleXlsxReader is (as of this writing) the fastest and most memory efficient Ruby xlsx parser. Recent updates here have focused on large spreadsheets with especially non-unique strings in sheets using xlsx' shared strings feature (Excel-generated spreadsheets always use this). Other projects have implemented streaming parsers for the sheet data, but currently none stream while loading the shared strings file, which is the second-largest file in an xlsx archive and can represent millions of strings in large files. For more details, see [my fork of @shkm's excel benchmark project](https://github.com/woahdae/excel-parsing-benchmarks), but here's the summary: 1mb excel file, 10,000 rows of sample "sales records" with a fair amount of non-unique strings (ran on an M1 Macbook Pro): | Gem | Parses/second | RSS Increase | Allocated Mem | Retained Mem | Allocated Objects | Retained Objects | |--------------------|---------------|--------------|---------------|--------------|-------------------|------------------| | simple_xlsx_reader | 1.13 | 36.94mb | 614.51mb | 1.13kb | 8796275 | 3 | | roo | 0.75 | 74.0mb | 164.47mb | 2.18kb | 2128396 | 4 | | creek | 0.65 | 107.55mb | 581.38mb | 3.3kb | 7240760 | 16 | | xsv | 0.61 | 75.66mb | 2127.42mb | 3.66kb | 5922563 | 10 | | rubyxl | 0.27 | 373.52mb | 716.7mb | 2.18kb | 10612577 | 4 | Here is a benchmark for the "worst" file I've seen, a 26mb file whose shared strings represent 10% of the archive (note, MemoryProfiler has too much overhead to reasonably measure allocations so that analysis was left off, and we just measure total time for one parse): | Gem | Time | RSS Increase | |--------------------|---------|--------------| | simple_xlsx_reader | 28.71s | 148.77mb | | roo | 40.25s | 1322.08mb | | xsv | 45.82s | 391.27mb | | creek | 60.63s | 886.81mb | | rubyxl | 238.68s | 9136.3mb | ## Installation Add this line to your application's Gemfile: gem 'simple_xlsx_reader' And then execute: $ bundle Or install it yourself as: $ gem install simple_xlsx_reader ## Versioning This project follows [semantic versioning 1.0](http://semver.org/spec/v1.0.0.html) ## Contributing Remember to write tests, think about edge cases, and run the existing suite. The full suite contains a performance test that on an M1 MBP runs the final large file in about five seconds. Check out that test before & after your change to check for performance changes. Then, the standard stuff: 1. Fork this project 2. Create your feature branch (`git checkout -b my-new-feature`) 3. Commit your changes (`git commit -am 'Add some feature'`) 4. Push to the branch (`git push origin my-new-feature`) 5. Create new Pull Request ================================================ FILE: Rakefile ================================================ # frozen_string_literal: true require "bundler/gem_tasks" require 'rake/testtask' Rake::TestTask.new do |t| t.pattern = "test/**/*_test.rb" t.libs << 'test' end task default: [:test] ================================================ FILE: lib/simple_xlsx_reader/document.rb ================================================ # frozen_string_literal: true require 'forwardable' module SimpleXlsxReader ## # Main class for the public API. See the README for usage examples, # or read the code, it's pretty friendly. class Document attr_reader :string_or_io def initialize(legacy_file_path = nil, file_path: nil, string_or_io: nil) fail(ArgumentError, 'either file_path or string_or_io must be provided') if legacy_file_path.nil? && file_path.nil? && string_or_io.nil? @string_or_io = string_or_io || File.new(legacy_file_path || file_path) end def sheets @sheets ||= Loader.new(string_or_io).init_sheets end # Expensive because it slurps all the sheets into memory, # probably only appropriate for testing def to_hash sheets.each_with_object({}) { |sheet, acc| acc[sheet.name] = sheet.rows.to_a; } end # `rows` is a RowsProxy that responds to #each class Sheet extend Forwardable attr_reader :name, :rows def_delegators :rows, :load_errors, :slurp def initialize(name:, sheet_parser:) @name = name @rows = RowsProxy.new(sheet_parser: sheet_parser) end # Legacy - consider `rows.each(headers: true)` for better performance def headers rows.slurped![0] end # Legacy - consider `rows` or `rows.each(headers: true)` for better # performance def data rows.slurped![1..-1] end end # Waits until we call #each with a block to parse the rows class RowsProxy include Enumerable attr_reader :slurped, :load_errors def initialize(sheet_parser:) @sheet_parser = sheet_parser @slurped = nil @load_errors = {} end # By default, #each streams the rows to the provided block, either as # arrays, or as header => cell value pairs if provided a `headers:` # argument. # # `headers` can be: # # * `true` - simply takes the first row as the header row # * block - calls the block with successive rows until the block returns # true, which it then uses that row for the headers. All data prior to # finding the headers is ignored. # * hash - transforms the header row by replacing cells with keys matched # by value, ex. `{id: /ID|Identity/, name: /Name/i, date: 'Date'}` would # potentially yield the row `{id: 5, name: 'Jane', date: [Date object]}` # instead of the headers from the sheet. It would also search for the # row that matches at least one header, in case the header row isn't the # first. # # If rows have been slurped, #each will iterate the slurped rows instead. # # Note, calls to this after slurping will raise if given the `headers:` # argument, as that's handled by the sheet parser. If this is important # to someone, speak up and we could potentially support it. def each(headers: false, &block) if slurped? raise '#each does not support headers with slurped rows' if headers slurped.each(&block) elsif block_given? # It's possible to slurp while yielding to the block, which would # null out @sheet_parser, so let's just keep track of it here too sheet_parser = @sheet_parser @sheet_parser.parse(headers: headers, &block).tap do @load_errors = sheet_parser.load_errors end else to_enum(:each, headers: headers) end end # Mostly for legacy support, I'm not aware of a use case for doing this # when you don't have to. # # Note that #each will use slurped results if available, and since we're # leveraging Enumerable, all the other Enumerable methods will too. def slurp # possibly release sheet parser from memory on next GC run; # untested, but it can hold a lot of stuff, so worth a try @slurped ||= to_a.tap { @sheet_parser = nil } end def slurped? !!@slurped end def slurped! check_slurped slurped end def [](*args) check_slurped slurped[*args] end def shift(*args) check_slurped slurped.shift(*args) end private def check_slurped slurp if SimpleXlsxReader.configuration.auto_slurp return if slurped? raise 'Called a slurp-y method without explicitly slurping;'\ ' use #each or call rows.slurp first' end end end end ================================================ FILE: lib/simple_xlsx_reader/hyperlink.rb ================================================ # frozen_string_literal: true module SimpleXlsxReader # We support hyperlinks as a "type" even though they're technically # represented either as a function or an external reference in the xlsx spec. # # In practice, hyperlinks are usually a link or a mailto. In the case of a # link, we probably want to follow it to download something, but in the case # of an email, we probably just want the email and not the mailto. So we # represent a hyperlink primarily as it is seen by the user, following the # principle of least surprise, but the url is accessible via #url. # # Microsoft calls the visible part of a hyperlink cell the "friendly name," # so we expose that as a method too, in case you want to be explicit about # how you're accessing it. # # See MS documentation on the HYPERLINK function for some background: # https://support.office.com/en-us/article/HYPERLINK-function-333c7ce6-c5ae-4164-9c47-7de9b76f577f class Hyperlink < String attr_reader :friendly_name attr_reader :url def initialize(url, friendly_name = nil) @url = url @friendly_name = friendly_name&.to_s super(@friendly_name || @url) end end end ================================================ FILE: lib/simple_xlsx_reader/loader/shared_strings_parser.rb ================================================ # frozen_string_literal: true module SimpleXlsxReader class Loader # For performance reasons, excel uses an optional SpreadsheetML feature # that puts all strings in a separate xml file, and then references # them by their index in that file. # # http://msdn.microsoft.com/en-us/library/office/gg278314.aspx class SharedStringsParser < Nokogiri::XML::SAX::Document def self.parse(file) new.tap do |parser| Nokogiri::XML::SAX::Parser.new(parser).parse(file) end.result end def initialize @result = [] @composite = false @extract = false end attr_reader :result def start_element(name, _attrs = []) case name when 'si' then @current_string = +"" # UTF-8 variant of String.new when 't' then @extract = true end end def characters(string) return unless @extract @current_string << string end def end_element(name) case name when 't' then @extract = false when 'si' then @result << @current_string end end end end end ================================================ FILE: lib/simple_xlsx_reader/loader/sheet_parser.rb ================================================ # frozen_string_literal: true require 'forwardable' module SimpleXlsxReader class Loader class SheetParser < Nokogiri::XML::SAX::Document extend Forwardable attr_accessor :xrels_file attr_accessor :hyperlinks_by_cell attr_reader :load_errors def_delegators :@loader, :style_types, :shared_strings, :base_date def initialize(file_io:, loader:) @file_io = file_io @loader = loader end def parse(headers: false, &block) raise 'parse called without a block; what should this do?'\ unless block_given? @headers = headers @each_callback = block @load_errors = {} @current_row_num = nil @last_seen_row_idx = 0 @url = nil # silence warnings @function = nil # silence warnings @capture = nil # silence warnings @captured = nil # silence warnings @dimension = nil # silence warnings @column_index = 0 @file_io.rewind # if it's IO from IO.read, we need to rewind it # In this project this is only used for GUI-made hyperlinks (as opposed # to FUNCTION-based hyperlinks). Unfortunately the're needed to parse # the spreadsheet, and they come AFTER the sheet data. So, solution is # to just stream-parse the file twice, first for the hyperlinks at the # bottom of the file, then for the file itself. In the future it would # be clever to use grep to extract the xml into its own smaller file. if xrels_file if xrels_file.grep(/hyperlink/).any? xrels_file.rewind load_gui_hyperlinks # represented as hyperlinks_by_cell end @file_io.rewind # we've already parsed this once end Nokogiri::XML::SAX::Parser.new(self).parse(@file_io) end ### # SAX document hooks def start_element_namespace(name, attrs = [], _prefix, _uri, _ns) case name when 'dimension' @dimension = attrs.last.value when 'row' @current_row_num = attrs.find {|attr| attr.localname == 'r'}&.value&.to_i @current_row = Array.new(column_length) @column_index = 0 when 'c' attrs = attrs.inject({}) {|acc, attr| acc[attr.localname] = attr.value; acc} @cell_name = attrs['r'] || column_number_to_letter(@column_index) @type = attrs['t'] @style = attrs['s'] && style_types[attrs['s'].to_i] @column_index += 1 when 'f' then @function = true when 'v', 't' then @capture = true end end def characters(string) if @function # the only "function" we support is a hyperlink @url = string.slice(/HYPERLINK\("(.*?)"/, 1) end return unless @capture captured = begin SimpleXlsxReader::Loader.cast( string, @type, @style, url: @url || hyperlinks_by_cell&.[](@cell_name), shared_strings: shared_strings, base_date: base_date ) rescue StandardError => e column, row = @cell_name.match(/([A-Z]+)([0-9]+)/).captures col_idx = column_letter_to_number(column) - 1 row_idx = row.to_i - 1 if !SimpleXlsxReader.configuration.catch_cell_load_errors error = CellLoadError.new( "Row #{row_idx}, Col #{col_idx}: #{e.message}" ) error.set_backtrace(e.backtrace) raise error else @load_errors[[row_idx, col_idx]] = e.message string end end # For some reason I can't figure out in a reasonable timeframe, # SAX parsing some workbooks captures separate strings in the same cell # when we encounter UTF-8, although I can't get workbooks made in my # own version of excel to repro it. Our fix is just to keep building # the string in this case, although maybe there's a setting in Nokogiri # to make it not do this (looked, couldn't find it). # # Loading the workbook test/chunky_utf8.xlsx repros the issue. @captured = @captured ? @captured + (captured || '') : captured end def end_element_namespace(name, _prefix, _uri) case name when 'row' if @headers == true # ya a little funky @headers = @current_row elsif @headers.is_a?(Hash) test_headers_hash_against_current_row # in case there were empty rows before finding the header @last_seen_row_idx = @current_row_num - 1 elsif @headers.respond_to?(:call) @headers = @current_row if @headers.call(@current_row) # in case there were empty rows before finding the header @last_seen_row_idx = @current_row_num - 1 elsif @headers possibly_yield_empty_rows(headers: true) yield_row(@current_row, headers: true) else possibly_yield_empty_rows(headers: false) yield_row(@current_row, headers: false) end @last_seen_row_idx += 1 # Note that excel writes a '/worksheet/dimension' node we can get # this from, but some libs (ex. simple_xlsx_writer) don't record it. # In that case, we assume the data is of uniform column length and # store the column name of the last header row we see. Obviously this # isn't the most robust strategy, but it likely fits 99% of use cases # considering it's not a problem with actual excel docs. @dimension = "A1:#{@cell_name}" if @dimension.nil? when 'v', 't' @current_row[cell_idx] = @captured @capture = false @captured = nil when 'f' then @function = false when 'c' then @url = nil end end ### # /End SAX hooks def test_headers_hash_against_current_row found = false @current_row.each_with_index do |cell, cell_idx| @headers.each_pair do |key, search| if search.is_a?(String) ? cell == search : cell&.match?(search) found = true @current_row[cell_idx] = key end end end @headers = @current_row if found end def possibly_yield_empty_rows(headers:) while @current_row_num && @current_row_num > @last_seen_row_idx + 1 @last_seen_row_idx += 1 yield_row(Array.new(column_length), headers: headers) end end def yield_row(row, headers:) if headers @each_callback.call(Hash[@headers.zip(row)]) else @each_callback.call(row) end end # This sax-parses the whole sheet, just to extract hyperlink refs at the end. def load_gui_hyperlinks self.hyperlinks_by_cell = HyperlinksParser.parse(@file_io, xrels: xrels) end class HyperlinksParser < Nokogiri::XML::SAX::Document def initialize(file_io, xrels:) @file_io = file_io @xrels = xrels end def self.parse(file_io, xrels:) new(file_io, xrels: xrels).parse end def parse @hyperlinks_by_cell = {} Nokogiri::XML::SAX::Parser.new(self).parse(@file_io) @hyperlinks_by_cell end def start_element_namespace(name, attrs, _prefix, _uri, _ns) case name when 'hyperlink' attrs = attrs.inject({}) {|acc, attr| acc[attr.localname] = attr.value; acc} id = attrs['id'] || attrs['r:id'] @hyperlinks_by_cell[attrs['ref']] = @xrels.at_xpath(%(//*[@Id="#{id}"])).attr('Target') end end end def xrels @xrels ||= Nokogiri::XML(xrels_file.read) if xrels_file end def column_length return 0 unless @dimension @column_length ||= column_letter_to_number(last_cell_letter) end def cell_idx column_letter_to_number(@cell_name.scan(/[A-Z]+/).first) - 1 end ## # Returns the last column name, ex. 'E' def last_cell_letter return unless @dimension @dimension.scan(/:([A-Z]+)/)&.first&.first || 'A' end # formula fits an exponential factorial function of the form: # 'A' = 1 # 'B' = 2 # 'Z' = 26 # 'AA' = 26 * 1 + 1 # 'AZ' = 26 * 1 + 26 # 'BA' = 26 * 2 + 1 # 'ZA' = 26 * 26 + 1 # 'ZZ' = 26 * 26 + 26 # 'AAA' = 26 * 26 * 1 + 26 * 1 + 1 # 'AAZ' = 26 * 26 * 1 + 26 * 1 + 26 # 'ABA' = 26 * 26 * 1 + 26 * 2 + 1 # 'BZA' = 26 * 26 * 2 + 26 * 26 + 1 def column_letter_to_number(column_letter) pow = column_letter.length - 1 result = 0 column_letter.each_byte do |b| result += 26**pow * (b - 64) pow -= 1 end result end def column_number_to_letter(n) result = [] loop do result.unshift((n % 26 + 65).chr) n = (n / 26) - 1 break if n < 0 end result.join end end end end ================================================ FILE: lib/simple_xlsx_reader/loader/style_types_parser.rb ================================================ # frozen_string_literal: true module SimpleXlsxReader class Loader StyleTypesParser = Struct.new(:file_io) do def self.parse(file_io) new(file_io).tap(&:parse).style_types end # Map of non-custom numFmtId to casting symbol NumFmtMap = { 0 => :string, # General 1 => :fixnum, # 0 2 => :float, # 0.00 3 => :fixnum, # #,##0 4 => :float, # #,##0.00 5 => :unsupported, # $#,##0_);($#,##0) 6 => :unsupported, # $#,##0_);[Red]($#,##0) 7 => :unsupported, # $#,##0.00_);($#,##0.00) 8 => :unsupported, # $#,##0.00_);[Red]($#,##0.00) 9 => :percentage, # 0% 10 => :percentage, # 0.00% 11 => :bignum, # 0.00E+00 12 => :unsupported, # # ?/? 13 => :unsupported, # # ??/?? 14 => :date, # mm-dd-yy 15 => :date, # d-mmm-yy 16 => :date, # d-mmm 17 => :date, # mmm-yy 18 => :time, # h:mm AM/PM 19 => :time, # h:mm:ss AM/PM 20 => :time, # h:mm 21 => :time, # h:mm:ss 22 => :date_time, # m/d/yy h:mm 37 => :unsupported, # #,##0 ;(#,##0) 38 => :unsupported, # #,##0 ;[Red](#,##0) 39 => :unsupported, # #,##0.00;(#,##0.00) 40 => :unsupported, # #,##0.00;[Red](#,##0.00) 44 => :float, # some odd currency format ?from Office 2007? 45 => :time, # mm:ss 46 => :time, # [h]:mm:ss 47 => :time, # mmss.0 48 => :bignum, # ##0.0E+0 49 => :unsupported # @ }.freeze def parse @xml = Nokogiri::XML(file_io.read).remove_namespaces! end # Excel doesn't record types for some cells, only its display style, so # we have to back out the type from that style. # # Some of these styles can be determined from a known set (see NumFmtMap), # while others are 'custom' and we have to make a best guess. # # This is the array of types corresponding to the styles a spreadsheet # uses, and includes both the known style types and the custom styles. # # Note that the xml sheet cells that use this don't reference the # numFmtId, but instead the array index of a style in the stored list of # only the styles used in the spreadsheet (which can be either known or # custom). Hence this style types array, rather than a map of numFmtId to # type. def style_types @xml.xpath('/styleSheet/cellXfs/xf').map do |xstyle| style_type_by_num_fmt_id( xstyle.attributes['numFmtId']&.value ) end end # Finds the type we think a style is; For example, fmtId 14 is a date # style, so this would return :date. # # Note, custom styles usually (are supposed to?) have a numFmtId >= 164, # but in practice can sometimes be simply out of the usual "Any Language" # id range that goes up to 49. For example, I have seen a numFmtId of # 59 specified as a date. In Thai, 59 is a number format, so this seems # like a bad idea, but we try to be flexible and just go with it. def style_type_by_num_fmt_id(id) return nil if id.nil? id = id.to_i NumFmtMap[id] || custom_style_types[id] end # Map of (numFmtId >= 164) (custom styles) to our best guess at the type # ex. {164 => :date_time} def custom_style_types @custom_style_types ||= @xml.xpath('/styleSheet/numFmts/numFmt') .each_with_object({}) do |xstyle, acc| acc[xstyle.attributes['numFmtId'].value.to_i] = determine_custom_style_type(xstyle.attributes['formatCode'].value) end end # This is the least deterministic part of reading xlsx files. Due to # custom styles, you can't know for sure when a date is a date other than # looking at its format and gessing. It's not impossible to guess right, # though. # # http://stackoverflow.com/questions/4948998/determining-if-an-xlsx-cell-is-date-formatted-for-excel-2007-spreadsheets def determine_custom_style_type(string) return :float if string[0] == '_' return :float if string[0] == ' 0' # Looks for one of ymdhis outside of meta-stuff like [Red] return :date_time if string =~ /(^|\])[^\[]*[ymdhis]/i :unsupported end end end end ================================================ FILE: lib/simple_xlsx_reader/loader/workbook_parser.rb ================================================ # frozen_string_literal: true module SimpleXlsxReader class Loader WorkbookParser = Struct.new(:file_io) do def self.parse(file_io) parser = new(file_io).tap(&:parse) [parser.sheet_toc, parser.base_date] end def parse @xml = Nokogiri::XML(file_io.read).remove_namespaces! end # Table of contents for the sheets, ex. {'Authors' => 0, ...} def sheet_toc @xml.xpath('/workbook/sheets/sheet') .each_with_object({}) do |sheet, acc| acc[sheet.attributes['name'].value] = sheet.attributes['sheetId'].value.to_i - 1 # keep things 0-indexed end end ## Returns the base_date from which to calculate dates. # Defaults to 1900 (minus two days due to excel quirk), but use 1904 if # it's set in the Workbook's workbookPr. # http://msdn.microsoft.com/en-us/library/ff530155(v=office.12).aspx def base_date return DATE_SYSTEM_1900 if @xml.nil? @xml.xpath('//workbook/workbookPr[@date1904]').each do |workbookPr| return DATE_SYSTEM_1904 if workbookPr['date1904'] =~ /true|1/i end DATE_SYSTEM_1900 end end end end ================================================ FILE: lib/simple_xlsx_reader/loader.rb ================================================ # frozen_string_literal: true module SimpleXlsxReader class Loader < Struct.new(:string_or_io) attr_accessor :shared_strings, :sheet_parsers, :sheet_toc, :style_types, :base_date def init_sheets ZipReader.new( string_or_io: string_or_io, loader: self ).read sheet_toc.each_with_index.map do |(sheet_name, _sheet_number), i| # sheet_number is *not* the index into xml.sheet_parsers SimpleXlsxReader::Document::Sheet.new( name: sheet_name, sheet_parser: sheet_parsers[i] ) end end ZipReader = Struct.new(:string_or_io, :loader, keyword_init: true) do attr_reader :zip def initialize(*args) super @zip = SimpleXlsxReader::Zip.open_buffer(string_or_io) end def read entry_at('xl/workbook.xml') do |file_io| loader.sheet_toc, loader.base_date = *WorkbookParser.parse(file_io) end entry_at('xl/styles.xml') do |file_io| loader.style_types = StyleTypesParser.parse(file_io) end # optional feature used by excel, # but not often used by xlsx generation libraries if (ss_entry = entry_at('xl/sharedStrings.xml')) ss_entry.get_input_stream do |file| loader.shared_strings = SharedStringsParser.parse(file) end else loader.shared_strings = [] end loader.sheet_parsers = [] # Sometimes there's a zero-index sheet.xml, ex. # Google Docs creates: # xl/worksheets/sheet.xml # xl/worksheets/sheet1.xml # xl/worksheets/sheet2.xml # While Excel creates: # xl/worksheets/sheet1.xml # xl/worksheets/sheet2.xml add_sheet_parser_at_index(nil) i = 1 while(add_sheet_parser_at_index(i)) do i += 1 end end def entry_at(path, &block) # Older and newer (post-mid-2021) RubyZip normalizes pathnames, # but unfortunately there is a time in between where it doesn't. # Rather than require a specific version, let's just be flexible. entry = zip.find_entry(path) || # *nix-generated zip.find_entry(path.tr('/', '\\')) || # Windows-generated zip.find_entry(path.downcase) || # Sometimes it's lowercase zip.find_entry(path.tr('/', '\\').downcase) # Sometimes it's lowercase if block entry.get_input_stream(&block) else entry end end def add_sheet_parser_at_index(i) sheet_file_name = "xl/worksheets/sheet#{i}.xml" return unless (entry = entry_at(sheet_file_name)) parser = SheetParser.new( file_io: entry.get_input_stream, loader: loader ) relationship_file_name = "xl/worksheets/_rels/sheet#{i}.xml.rels" if (rel = entry_at(relationship_file_name)) parser.xrels_file = rel.get_input_stream end loader.sheet_parsers << parser end end ## # The heart of typecasting. The ruby type is determined either explicitly # from the cell xml or implicitly from the cell style, and this # method expects that work to have been done already. This, then, # takes the type we determined it to be and casts the cell value # to that type. # # types: # - s: shared string (see #shared_string) # - n: number (cast to a float) # - b: boolean # - str: string # - inlineStr: string # - ruby symbol: for when type has been determined by style # # options: # - shared_strings: needed for 's' (shared string) type def self.cast(value, type, style, options = {}) return nil if value.nil? || value.empty? # Sometimes the type is dictated by the style alone if type.nil? || (type == 'n' && %i[date time date_time].include?(style)) type = style end casted = case type ## # There are few built-in types ## when 's' # shared string options[:shared_strings][value.to_i] when 'n' # number value.to_f when 'b' value.to_i == 1 when 'str' value when 'inlineStr' value ## # Type can also be determined by a style, # detected earlier and cast here by its standardized symbol ## # no type encoded with the the General format defaults to a number type when nil, :string retval = Integer(value, exception: false) retval ||= Float(value, exception: false) retval ||= value retval when :unsupported value when :fixnum value.to_i when :float value.to_f when :percentage value.to_f # the trickiest. note that all these formats can vary on # whether they actually contain a date, time, or datetime. when :date, :time, :date_time value = Float(value) days_since_date_system_start = value.to_i fraction_of_24 = value - days_since_date_system_start # http://stackoverflow.com/questions/10559767/how-to-convert-ms-excel-date-from-float-to-date-format-in-ruby date = options.fetch(:base_date, DATE_SYSTEM_1900) + days_since_date_system_start if fraction_of_24 > 0 # there is a time associated seconds = (fraction_of_24 * 86_400).round return Time.utc(date.year, date.month, date.day) + seconds else return date end when :bignum if defined?(BigDecimal) BigDecimal(value) else value.to_f end ## # Beats me ## else value end if options[:url] Hyperlink.new(options[:url], casted) else casted end end end end ================================================ FILE: lib/simple_xlsx_reader/version.rb ================================================ # frozen_string_literal: true module SimpleXlsxReader VERSION = '5.1.0' end ================================================ FILE: lib/simple_xlsx_reader.rb ================================================ # frozen_string_literal: true require 'nokogiri' require 'date' require 'simple_xlsx_reader/version' require 'simple_xlsx_reader/hyperlink' require 'simple_xlsx_reader/document' require 'simple_xlsx_reader/loader' require 'simple_xlsx_reader/loader/workbook_parser' require 'simple_xlsx_reader/loader/shared_strings_parser' require 'simple_xlsx_reader/loader/sheet_parser' require 'simple_xlsx_reader/loader/style_types_parser' # Rubyzip 1.0 only has different naming, everything else is the same, so let's # be flexible so we don't force people into a dependency hell w/ other gems. begin # Try loading rubyzip < 1.0 require 'zip/zip' require 'zip/zipfilesystem' SimpleXlsxReader::Zip = Zip::ZipFile rescue LoadError # Try loading rubyzip >= 1.0 require 'zip' require 'zip/filesystem' SimpleXlsxReader::Zip = Zip::File end module SimpleXlsxReader DATE_SYSTEM_1900 = Date.new(1899, 12, 30) DATE_SYSTEM_1904 = Date.new(1904, 1, 1) class CellLoadError < StandardError; end class << self def configuration @configuration ||= Struct.new(:catch_cell_load_errors, :auto_slurp).new.tap do |c| c.catch_cell_load_errors = false c.auto_slurp = false end end def open(file_path) Document.new(file_path: file_path).tap(&:sheets) end def parse(string_or_io) Document.new(string_or_io: string_or_io).tap(&:sheets) end end end ================================================ FILE: simple_xlsx_reader.gemspec ================================================ # -*- encoding: utf-8 -*- lib = File.expand_path('../lib', __FILE__) $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib) require 'simple_xlsx_reader/version' Gem::Specification.new do |gem| gem.name = "simple_xlsx_reader" gem.version = SimpleXlsxReader::VERSION gem.authors = ["Woody Peterson"] gem.email = ["woody.peterson@gmail.com"] gem.description = %q{Read xlsx data the Ruby way} gem.summary = %q{Read xlsx data the Ruby way} gem.homepage = "" gem.license = "MIT" gem.add_dependency 'nokogiri' gem.add_dependency 'rubyzip' gem.add_development_dependency 'minitest', '>= 5.0' gem.add_development_dependency 'rake' gem.add_development_dependency 'pry' gem.files = `git ls-files`.split($/) gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) } gem.test_files = gem.files.grep(%r{^test/}) gem.require_paths = ["lib"] end ================================================ FILE: test/date1904_test.rb ================================================ # frozen_string_literal: true require_relative 'test_helper' describe SimpleXlsxReader do let(:date1904_file) { File.join(File.dirname(__FILE__), 'date1904.xlsx') } let(:subject) { SimpleXlsxReader::Document.new(date1904_file) } it 'supports converting dates with the 1904 date system' do _(subject.to_hash).must_equal( 'date1904' => [[Date.parse('2014-05-01')]] ) end end ================================================ FILE: test/datetime_test.rb ================================================ # frozen_string_literal: true require_relative 'test_helper' describe SimpleXlsxReader do let(:datetimes_file) do File.join( File.dirname(__FILE__), 'datetimes.xlsx' ) end let(:subject) { SimpleXlsxReader::Document.new(datetimes_file) } it 'converts date_times with the correct precision' do _(subject.to_hash).must_equal( 'Datetimes' => [ [Time.parse('2013-08-19 18:29:59 UTC')], [Time.parse('2013-08-19 18:30:00 UTC')], [Time.parse('2013-08-19 18:30:01 UTC')], [Time.parse('1899-12-30 00:30:00 UTC')] ] ) end end ================================================ FILE: test/gdocs_sheet_test.rb ================================================ # frozen_string_literal: true require_relative 'test_helper' require 'time' describe SimpleXlsxReader do let(:one_sheet_file) { File.join(File.dirname(__FILE__), 'gdocs_sheet.xlsx') } let(:subject) { SimpleXlsxReader::Document.new(one_sheet_file) } it 'able to load file from google docs' do _(subject.to_hash).must_equal( 'List 1' => [['Empty gdocs list 1']], 'List 2' => [['Empty gdocs list 2']] ) end end ================================================ FILE: test/lower_case_sharedstrings_test.rb ================================================ # frozen_string_literal: true require_relative 'test_helper' describe SimpleXlsxReader do let(:lower_case_shared_strings) do File.join( File.dirname(__FILE__), 'lower_case_sharedstrings.xlsx' ) end let(:subject) { SimpleXlsxReader::Document.new(lower_case_shared_strings) } describe '#to_hash' do it 'should have the word Well in the first row' do _(subject.sheets.first.rows.to_a[0]).must_include('Well') end end end ================================================ FILE: test/namespaces_and_missing_atts_test.rb ================================================ # frozen_string_literal: true require_relative 'test_helper' describe SimpleXlsxReader do # Based on a real-world sheet possibly generated by PowerBI, where the xml # has namespacing and rows are missing the 'r' attribute. let(:sheet) do <<~XML Salmon Trout Cat Dog XML end let(:styles) do <<~XML XML end let(:wonky_file) do TestXlsxBuilder.new( sheets: [sheet], styles: styles ) end let(:subject) { SimpleXlsxReader::Document.new(wonky_file.archive.path) } describe '#to_hash' do it 'should extract values from namespaced cells missing "r" attributes' do _(subject.sheets.first.rows.to_a[0]).must_include('Salmon') _(subject.sheets.first.rows.to_a[1]).must_include('Dog') end end end ================================================ FILE: test/performance_test.rb ================================================ # frozen_string_literal: true require_relative 'test_helper' require 'minitest/benchmark' describe 'SimpleXlsxReader Benchmark' do # n is 0-indexed for us, then converted to 1-indexed for excel def sheet_with_n_rows(row_count) acc = +"" acc << <<~XML XML row_count.times.each do |n| n += 1 acc << <<~XML Cell A#{n} 2.4 30687 Cell D#{n} Cell E#{n} 2.4 30687 Cell H#{n} Cell I#{n} 2.4 30687 Cell L#{n} XML end acc << <<~XML XML end let(:styles) do # s='0' above refers to the value of numFmtId at cellXfs index 0, # which is in this case 'General' type _styles = <<-XML XML end before do @xlsxs = {} # Every new sheet has one more row self.class.bench_range.each do |num_rows| @xlsxs[num_rows] = TestXlsxBuilder.new( sheets: [sheet_with_n_rows(num_rows)], styles: styles ).archive end end def self.bench_range # Works out to a max just shy of 265k rows, which takes ~20s on my M1 Mac. # Second-largest is ~65k rows @ ~5s. max = ENV['BIG_PERF_TEST'] ? 265_000 : 66_000 bench_exp(100, max, 4) end bench_performance_linear 'parses sheets in linear time', 0.999 do |n| SimpleXlsxReader.open(@xlsxs[n].path).sheets[0].rows.each(headers: true) {|_row| } end end ================================================ FILE: test/shared_strings.xml ================================================ Cell A1 Cell B1 My Cell Cell A2 Cell B2 Cell Fmt ’ When it sees a unicode character (such as the fancy apostrophe starting this sentence), it starts chunking the stream for at least the current node, and we have to keep consuming the characters until we hit the end of the text. We can't assume that the string first given by the SAX callback us is the whole shared string content. It only happens with both unicode *and* really long text. ================================================ FILE: test/simple_xlsx_reader_test.rb ================================================ # frozen_string_literal: true require_relative 'test_helper' require 'time' SXR = SimpleXlsxReader describe SimpleXlsxReader do let(:sesame_street_blog_file) do File.join(File.dirname(__FILE__), 'sesame_street_blog.xlsx') end let(:document) { SimpleXlsxReader.open(sesame_street_blog_file) } ## # A high-level acceptance test testing misc features such as date/time parsing, # hyperlinks (both function and ref kinds), formula dates, emty rows, etc. let(:sesame_street_blog_file_path) { File.join(File.dirname(__FILE__), 'sesame_street_blog.xlsx') } let(:sesame_street_blog_io) { File.new(sesame_street_blog_file_path) } let(:sesame_street_blog_string) { IO.read(sesame_street_blog_file_path) } let(:expected_result) do { 'Authors' => [ ['Name', 'Occupation'], ['Big Bird', 'Teacher'] ], 'Posts' => [ ['Author Name', 'Title', 'Body', 'Created At', 'Comment Count', 'URL'], ['Big Bird', 'The Number 1', 'The Greatest', Time.parse('2002-01-01 11:00:00 UTC'), 1, SXR::Hyperlink.new('http://www.example.com/hyperlink-function', 'This uses the HYPERLINK() function')], ['Big Bird', 'The Number 2', 'Second Best', Time.parse('2002-01-02 14:00:00 UTC'), 2, SXR::Hyperlink.new('http://www.example.com/hyperlink-gui', 'This uses the hyperlink GUI option')], ['Big Bird', 'Formula Dates', 'Tricky tricky', Time.parse('2002-01-03 14:00:00 UTC'), 0, nil], ['Empty Eagress', nil, 'The title, date, and comment have types, but no values', nil, nil, nil] ] } end describe SimpleXlsxReader do describe 'load from file path' do let(:subject) { SimpleXlsxReader.open(sesame_street_blog_file_path) } it 'reads an xlsx file into a hash of {[sheet name] => [data]}' do _(subject.to_hash).must_equal(expected_result) end end describe 'load from buffer' do let(:subject) { SimpleXlsxReader.parse(sesame_street_blog_io) } it 'reads an xlsx buffer into a hash of {[sheet name] => [data]}' do _(subject.to_hash).must_equal(expected_result) end end describe 'load from string' do let(:subject) { SimpleXlsxReader.parse(sesame_street_blog_string) } it 'reads an xlsx string into a hash of {[sheet name] => [data]}' do _(subject.to_hash).must_equal(expected_result) end end it 'outputs strings in UTF-8 encoding' do document = SimpleXlsxReader.parse(sesame_street_blog_io) _(document.sheets[0].rows.to_a.flatten.map(&:encoding).uniq) .must_equal [Encoding::UTF_8] end it 'can use all our enumerable nicities without slurping' do document = SimpleXlsxReader.parse(sesame_street_blog_io) headers = { name: 'Author Name', title: 'Title', body: 'Body', created_at: 'Created At', count: /Count/ } rows = document.sheets[1].rows result = rows.each(headers: headers).with_index.with_object({}) do |(row, i), acc| acc[i] = row end _(result[0]).must_equal( name: 'Big Bird', title: 'The Number 1', body: 'The Greatest', created_at: Time.parse('2002-01-01 11:00:00 UTC'), count: 1, "URL" => 'This uses the HYPERLINK() function' ) _(rows.slurped?).must_equal false end end ## # For more fine-grained unit tests, we sometimes build our own workbook via # Nokogiri. TestXlsxBuilder has some defaults, and this let-style lets us # concisely override them in nested describe blocks. let(:shared_strings) { nil } let(:styles) { nil } let(:sheet) { nil } let(:workbook) { nil } let(:rels) { nil } let(:xlsx) do TestXlsxBuilder.new( shared_strings: shared_strings, styles: styles, sheets: sheet && [sheet], workbook: workbook, rels: rels ) end let(:reader) { SimpleXlsxReader.open(xlsx.archive.path) } describe 'when parsing escaped characters' do let(:escaped_content) do '<a href="https://www.example.com">Link A</a> &bull; <a href="https://www.example.com">Link B</a>' end let(:unescaped_content) do 'Link ALink B' end let(:sheet) do <<~XML 0 #{escaped_content} XML end let(:shared_strings) do <<~XML #{escaped_content} XML end it 'loads correctly using inline strings' do _(reader.sheets[0].rows.slurp[0][0]).must_equal(unescaped_content) end it 'loads correctly using shared strings' do _(reader.sheets[0].rows.slurp[0][1]).must_equal(unescaped_content) end end describe 'Sheet#rows#each(headers: true)' do let(:sheet) do <<~XML Header 1 Header 2 Data 1-A Data 1-B Data 2-A Data 2-B XML end it 'yields rows as hashes' do acc = [] reader.sheets[0].rows.each(headers: true) do |row| acc << row end _(acc).must_equal( [ { 'Header 1' => 'Data 1-A', 'Header 2' => 'Data 1-B' }, { 'Header 1' => nil, 'Header 2' => nil }, { 'Header 1' => 'Data 2-A', 'Header 2' => 'Data 2-B' } ] ) end end describe 'Sheet#rows#each(headers: ->(row) {...})' do let(:sheet) do <<~XML a chart or something Rabble rabble Chatty junk Header 1 Header 2 Data 1-A Data 1-B Data 2-A Data 2-B XML end it 'yields rows as hashes' do acc = [] finder = ->(row) { row.find {|c| c&.match(/Header/)} } reader.sheets[0].rows.each(headers: finder) do |row| acc << row end _(acc).must_equal( [ { 'Header 1' => 'Data 1-A', 'Header 2' => 'Data 1-B' }, { 'Header 1' => nil, 'Header 2' => nil }, { 'Header 1' => 'Data 2-A', 'Header 2' => 'Data 2-B' } ] ) end end describe "Sheet#rows#each(headers: a_hash)" do let(:sheet) do Nokogiri::XML( <<~XML a chart or something Rabble rabble Rabble rabble Chatty junk ID Number ExacT FOO Name ID 1-A Exact 1-B Name 1-C ID 2-A Exact 2-B Name 2-C XML ) end it 'transforms headers into symbols based on the header map' do header_map = {id: /ID/, name: /foo/i, exact: 'ExacT'} result = reader.sheets[0].rows.each(headers: header_map).to_a _(result).must_equal( [ { id: 'ID 1-A', exact: 'Exact 1-B', name: 'Name 1-C' }, { id: nil, exact: nil, name: nil }, { id: 'ID 2-A', exact: 'Exact 2-B', name: 'Name 2-C' }, ] ) end it 'if a match isnt found, uses un-matched header name' do sheet.xpath("//*[text() = 'ExacT']") .first.children.first.content = 'not ExacT' header_map = {id: /ID/, name: /foo/i, exact: 'ExacT'} result = reader.sheets[0].rows.each(headers: header_map).to_a _(result).must_equal( [ { id: 'ID 1-A', 'not ExacT' => 'Exact 1-B', name: 'Name 1-C' }, { id: nil, 'not ExacT' => nil, name: nil }, { id: 'ID 2-A', 'not ExacT' => 'Exact 2-B', name: 'Name 2-C' }, ] ) end end describe 'Sheet#rows[]' do it 'raises a RuntimeError if rows not slurped yet' do _(-> { reader.sheets[0].rows[1] }).must_raise(RuntimeError) end it 'works if the rows have been slurped' do _(reader.sheets[0].rows.tap(&:slurp)[0]).must_equal( ['Cell A', 'Cell B', 'Cell C'] ) end it 'works if the config allows auto slurping' do SimpleXlsxReader.configuration.auto_slurp = true _(reader.sheets[0].rows[0]).must_equal( ['Cell A', 'Cell B', 'Cell C'] ) SimpleXlsxReader.configuration.auto_slurp = false end end describe 'Sheet#rows#slurp' do let(:rows) { reader.sheets[0].rows.tap(&:slurp) } it 'loads the sheet parser results into memory' do _(rows.slurped).must_equal( [['Cell A', 'Cell B', 'Cell C']] ) end it '#each and #map use slurped results' do _(rows.map(&:reverse)).must_equal( [['Cell C', 'Cell B', 'Cell A']] ) end end describe 'Sheet#rows#each' do let(:sheet) do <<~XML Header 1 Header 2 Data 1-A Data 1-B Data 2-A Data 2-B XML end let(:rows) { reader.sheets[0].rows } it 'with no block, returns an enumerator when not slurped' do _(rows.each.class).must_equal Enumerator end it 'with no block, passes on header argument in enumerator' do _(rows.each(headers: true).inspect).must_match 'headers: true' end it 'returns an enumerator when slurped' do rows.slurp _(rows.each.class).must_equal Enumerator end end describe 'Sheet#rows#map' do let(:sheet) do <<~XML Header 1 Header 2 Data 1-A Data 1-B Data 2-A Data 2-B XML end let(:rows) { reader.sheets[0].rows } it 'does not slurp' do _(rows.map(&:first)).must_equal( ["Header 1", "Data 1-A", nil, "Data 2-A"] ) _(rows.slurped?).must_equal false end end describe 'Sheet#headers' do let(:doc_sheet) { reader.sheets[0] } it 'raises a RuntimeError if rows not slurped yet' do _(-> { doc_sheet.headers }).must_raise(RuntimeError) end it 'returns first row if slurped' do _(doc_sheet.tap(&:slurp).headers).must_equal( ['Cell A', 'Cell B', 'Cell C'] ) end it 'returns first row if auto_slurp' do SimpleXlsxReader.configuration.auto_slurp = true _(doc_sheet.headers).must_equal( ['Cell A', 'Cell B', 'Cell C'] ) SimpleXlsxReader.configuration.auto_slurp = false end end describe SimpleXlsxReader::Loader do let(:described_class) { SimpleXlsxReader::Loader } describe '::cast' do it 'reads type s as a shared string' do _(described_class.cast('1', 's', nil, shared_strings: %w[a b c])) .must_equal 'b' end it 'reads type inlineStr as a string' do _(described_class.cast('the value', nil, 'inlineStr')) .must_equal 'the value' end it 'reads date styles' do _(described_class.cast('41505', nil, :date)) .must_equal Date.parse('2013-08-19') end it 'reads time styles' do _(described_class.cast('41505.77083', nil, :time)) .must_equal Time.parse('2013-08-19 18:30 UTC') end it 'reads date_time styles' do _(described_class.cast('41505.77083', nil, :date_time)) .must_equal Time.parse('2013-08-19 18:30 UTC') end it 'reads number types styled as dates' do _(described_class.cast('41505', 'n', :date)) .must_equal Date.parse('2013-08-19') end it 'reads number types styled as times' do _(described_class.cast('41505.77083', 'n', :time)) .must_equal Time.parse('2013-08-19 18:30 UTC') end it 'reads less-than-zero complex number types styled as times' do _(described_class.cast('6.25E-2', 'n', :time)) .must_equal Time.parse('1899-12-30 01:30:00 UTC') end it 'reads number types styled as date_times' do _(described_class.cast('41505.77083', 'n', :date_time)) .must_equal Time.parse('2013-08-19 18:30 UTC') end it 'raises when date-styled values are not numerical' do _(-> { described_class.cast('14 is not a valid date', nil, :date) }) .must_raise(ArgumentError) end describe 'with the url option' do let(:url) { 'http://www.example.com/hyperlink' } it 'creates a hyperlink with a string type' do _(described_class.cast('A link', 'str', :string, url: url)) .must_equal SXR::Hyperlink.new(url, 'A link') end it 'creates a hyperlink with a shared string type' do _(described_class.cast('2', 's', nil, shared_strings: %w[a b c], url: url)) .must_equal SXR::Hyperlink.new(url, 'c') end it 'creates a hyperlink with a fixnum friendly_name' do _(described_class.cast('123', nil, :fixnum, url: url)) .must_equal SXR::Hyperlink.new(url, '123') end end end describe 'shared_strings' do let(:xml) do File.open(File.join(File.dirname(__FILE__), 'shared_strings.xml')) end let(:ss) { SimpleXlsxReader::Loader::SharedStringsParser.parse(xml) } it 'parses strings formatted at the cell level' do _(ss[0..2]).must_equal ['Cell A1', 'Cell B1', 'My Cell'] end it 'parses strings formatted at the character level' do _(ss[3..5]).must_equal ['Cell A2', 'Cell B2', 'Cell Fmt'] end it 'parses looong strings containing unicode' do _(ss[6]).must_include 'It only happens with both unicode *and* really long text.' end end describe 'style_types' do let(:xml_file) do File.open(File.join(File.dirname(__FILE__), 'styles.xml')) end let(:parser) do SimpleXlsxReader::Loader::StyleTypesParser.new(xml_file).tap(&:parse) end it 'reads custom formatted styles (numFmtId >= 164)' do _(parser.style_types[1]).must_equal :date_time _(parser.custom_style_types[164]).must_equal :date_time end # something I've seen in the wild; don't think it's correct, but let's be flexible. it 'reads custom formatted styles given an id < 164, but not explicitly defined in the SpreadsheetML spec' do _(parser.style_types[2]).must_equal :date_time _(parser.custom_style_types[59]).must_equal :date_time end end describe '#last_cell_label' do # Note, this is not a valid sheet, since the last cell is actually D1 but # the dimension specifies C1. This is just for testing. let(:sheet) do Nokogiri::XML( <<-XML Cell A Cell C Cell D XML ).remove_namespaces! end let(:loader) do SimpleXlsxReader::Loader.new(nil).tap do |l| l.shared_strings = [] l.sheet_toc = { 'Sheet1': 0 } l.style_types = [] l.base_date = SimpleXlsxReader::DATE_SYSTEM_1900 end end let(:sheet_parser) do tempfile = Tempfile.new(['sheet', '.xml']) tempfile.write(sheet) tempfile.rewind SimpleXlsxReader::Loader::SheetParser.new( file_io: tempfile, loader: loader ).tap { |parser| parser.parse {} } end it 'uses /worksheet/dimension if available' do _(sheet_parser.last_cell_letter).must_equal 'C' end it 'uses the last header cell if /worksheet/dimension is missing' do sheet.at_xpath('/worksheet/dimension').remove _(sheet_parser.last_cell_letter).must_equal 'D' end it 'returns "A1" if the dimension is just one cell' do sheet.xpath('/worksheet/sheetData/row').remove sheet.xpath('/worksheet/dimension').attr('ref', 'A1') _(sheet_parser.last_cell_letter).must_equal 'A' end it 'returns nil if the sheet is just one cell, but /worksheet/dimension is missing' do sheet.xpath('/worksheet/sheetData/row').remove sheet.xpath('/worksheet/dimension').remove _(sheet_parser.last_cell_letter).must_be_nil end end describe '#column_letter_to_number' do let(:subject) { SXR::Loader::SheetParser.new(file_io: nil, loader: nil) } [ ['A', 1], ['B', 2], ['Z', 26], ['AA', 27], ['AB', 28], ['AZ', 52], ['BA', 53], ['BZ', 78], ['ZZ', 702], ['AAA', 703], ['AAZ', 728], ['ABA', 729], ['ABZ', 754], ['AZZ', 1378], ['ZZZ', 18_278] ].each do |(letter, number)| it "converts #{letter} to #{number}" do _(subject.column_letter_to_number(letter)).must_equal number end end end end describe 'parse errors' do after do SimpleXlsxReader.configuration.catch_cell_load_errors = false end let(:sheet) do Nokogiri::XML( <<-XML 14 is a date style; this is not a date XML ).remove_namespaces! end let(:styles) do # s='0' above refers to the value of numFmtId at cellXfs index 0 Nokogiri::XML( <<-XML XML ).remove_namespaces! end it 'raises if configuration.catch_cell_load_errors' do SimpleXlsxReader.configuration.catch_cell_load_errors = false _(-> { SimpleXlsxReader.open(xlsx.archive.path).to_hash }) .must_raise(SimpleXlsxReader::CellLoadError) end it 'records a load error if not configuration.catch_cell_load_errors' do SimpleXlsxReader.configuration.catch_cell_load_errors = true sheet = SimpleXlsxReader.open(xlsx.archive.path).sheets[0].tap(&:slurp) _(sheet.load_errors).must_equal( [0, 0] => 'invalid value for Float(): "14 is a date style; this is not a date"' ) end end describe 'missing numFmtId attributes' do let(:sheet) do Nokogiri::XML( <<-XML some content XML ).remove_namespaces! end let(:styles) do Nokogiri::XML( <<-XML XML ).remove_namespaces! end before do @row = SimpleXlsxReader.open(xlsx.archive.path).sheets[0].rows.to_a[0] end it 'continues even when cells are missing numFmtId attributes ' do _(@row[0]).must_equal 'some content' end end describe 'parsing types' do let(:sheet) do Nokogiri::XML( <<-XML Cell A1 2.4 30687 Cell G1 HYPERLINK("http://www.example.com/hyperlink-function", "HYPERLINK function") HYPERLINK function GUI-made hyperlink 1 XML ).remove_namespaces! end let(:styles) do # s='0' above refers to the value of numFmtId at cellXfs index 0, # which is in this case 'General' type Nokogiri::XML( <<-XML XML ).remove_namespaces! end # Although not a "type" or "style" according to xlsx spec, # it sure could/should be, so let's test it with the rest of our # typecasting code. let(:rels) do [ Nokogiri::XML( <<-XML XML ).remove_namespaces! ] end before do @row = SimpleXlsxReader.open(xlsx.archive.path).sheets[0].rows.to_a[0] end it "reads 'Generic' cells as strings" do _(@row[0]).must_equal 'Cell A1' end it "reads empty 'Generic' cells as nil" do _(@row[1]).must_be_nil end # We could expand on these type tests, but really just a couple # demonstrate that it's wired together. Type-specific tests should go # on #cast it 'reads floats' do _(@row[2]).must_equal 2.4 end it 'reads empty floats as nil' do _(@row[3]).must_be_nil end it 'reads dates' do _(@row[4]).must_equal Date.parse('Jan 6, 1984') end it 'reads empty date cells as nil' do _(@row[5]).must_be_nil end it 'reads strings formatted as inlineStr' do _(@row[6]).must_equal 'Cell G1' end it 'reads hyperlinks created via HYPERLINK()' do _(@row[7]).must_equal( SXR::Hyperlink.new( 'http://www.example.com/hyperlink-function', 'HYPERLINK function' ) ) end it 'reads hyperlinks created via the GUI' do _(@row[8]).must_equal( SXR::Hyperlink.new( 'http://www.example.com/hyperlink-gui', 'GUI-made hyperlink' ) ) end it "reads 'Generic' cells with numbers as numbers" do _(@row[9]).must_equal 1 end end describe 'parsing documents with blank rows' do let(:sheet) do Nokogiri::XML( <<-XML a 1 2 3 XML ).remove_namespaces! end before do @rows = SimpleXlsxReader.open(xlsx.archive.path).sheets[0].rows.to_a end it 'reads row data despite gaps in row numbering' do _(@rows).must_equal [ [nil, nil, nil, nil], ['a', nil, nil, nil], [nil, nil, nil, nil], [nil, 1, nil, nil], [nil, nil, 2, nil], [nil, nil, nil, nil], [nil, nil, nil, 3] ] end end describe 'parsing documents with non-hyperlinked rels' do let(:rels) do [ Nokogiri::XML( <<-XML XML ).remove_namespaces! ] end describe 'when document is opened as path' do before do @row = SimpleXlsxReader.open(xlsx.archive.path).sheets[0].rows.to_a[0] end it 'reads cell content' do _(@row[0]).must_equal 'Cell A' end end describe 'when document is parsed as a String' do before do output = File.binread(xlsx.archive.path) @row = SimpleXlsxReader.parse(output).sheets[0].rows.to_a[0] end it 'reads cell content' do _(@row[0]).must_equal 'Cell A' end end describe 'when document is parsed as StringIO' do before do stream = StringIO.new(File.binread(xlsx.archive.path), 'rb') @row = SimpleXlsxReader.parse(stream).sheets[0].rows.to_a[0] stream.close end it 'reads cell content' do _(@row[0]).must_equal 'Cell A' end end end # https://support.microsoft.com/en-us/office/available-number-formats-in-excel-0afe8f52-97db-41f1-b972-4b46e9f1e8d2 describe 'numeric fields styled as "General"' do let(:misc_numbers_path) do File.join(File.dirname(__FILE__), 'misc_numbers.xlsx') end let(:sheet) { SimpleXlsxReader.open(misc_numbers_path).sheets[0] } it 'reads medium sized integers as integers' do _(sheet.rows.slurp[1][0]).must_equal 98070 end it 'reads large (>12 char) integers as integers' do _(sheet.rows.slurp[1][1]).must_equal 1234567890123 end end describe 'with mysteriously chunky UTF-8 text' do let(:chunky_utf8_path) do File.join(File.dirname(__FILE__), 'chunky_utf8.xlsx') end let(:sheet) { SimpleXlsxReader.open(chunky_utf8_path).sheets[0] } it 'reads the whole cell text' do _(sheet.rows.slurp[1]).must_equal( ["sample-company-1", "Korntal-Münchingen", "Bronholmer straße"] ) end end describe 'when using percentages & currencies' do let(:pnc_path) do # This file provided by a GitHub user having parse errors in these fields File.join(File.dirname(__FILE__), 'percentages_n_currencies.xlsx') end let(:sheet) { SimpleXlsxReader.open(pnc_path).sheets[0] } it 'reads percentages as floats of the form 0.XX' do _(sheet.rows.slurp[1][2]).must_equal(0.87) end it 'reads currencies as floats' do _(sheet.rows.slurp[1][4]).must_equal(300.0) end end end ================================================ FILE: test/styles.xml ================================================ ================================================ FILE: test/test_helper.rb ================================================ # frozen_string_literal: true gem 'minitest' require 'minitest/autorun' require 'minitest/spec' require 'pry' require 'time' require 'test_xlsx_builder' $LOAD_PATH.unshift File.expand_path('lib') require 'simple_xlsx_reader' ================================================ FILE: test/test_xlsx_builder.rb ================================================ # frozen_string_literal: true require 'nokogiri' TestXlsxBuilder = Struct.new(:shared_strings, :styles, :sheets, :workbook, :rels, keyword_init: true) do DEFAULTS = { workbook: Nokogiri::XML( <<-XML XML ).remove_namespaces!, styles: Nokogiri::XML( <<-XML XML ).remove_namespaces!, sheet: Nokogiri::XML( <<-XML Cell A Cell B Cell C XML ).remove_namespaces! } def initialize(*args) super self.workbook ||= DEFAULTS[:workbook] self.styles ||= DEFAULTS[:styles] self.sheets ||= [DEFAULTS[:sheet]] self.rels ||= [] end def archive tmpfile = Tempfile.new(['workbook', '.xlsx']) tmpfile.binmode tmpfile.rewind Zip::File.open(tmpfile.path, create: true) do |zip| zip.mkdir('xl') zip.get_output_stream('xl/workbook.xml') do |wb_file| wb_file.write(workbook) end zip.get_output_stream('xl/styles.xml') do |styles_file| styles_file.write(styles) end if shared_strings zip.get_output_stream('xl/sharedStrings.xml') do |ss_file| ss_file.write(shared_strings) end end zip.mkdir('xl/worksheets') sheets.each_with_index do |sheet, i| zip.get_output_stream("xl/worksheets/sheet#{i + 1}.xml") do |sf| sf.write(sheet) end if rels[i] zip.mkdir('xl/worksheets/_rels') zip.get_output_stream("xl/worksheets/_rels/sheet#{i + 1}.xml.rels") do |rf| rf.write(rels[i]) end end end end tmpfile end end