Update. Everything below is inessential, since I've found the stackoverflow answer about hdf5.
The only thing don't agree with is blaze, I've tried it and it is obviously raw and needs much time to become not even stable but at least really useful.
My current workflow is completely based on IPython, and I'm working much with pandas (which I personally consider as a good example of poor library design).
Nevertheless, I moved recently to HDF, though installing pyTables (which is needed to use hdf with pandas) isn't as straightforward as I expected.
And now I convert all the data to hdf.
The only thing don't agree with is blaze, I've tried it and it is obviously raw and needs much time to become not even stable but at least really useful.
My current workflow is completely based on IPython, and I'm working much with pandas (which I personally consider as a good example of poor library design).
Nevertheless, I moved recently to HDF, though installing pyTables (which is needed to use hdf with pandas) isn't as straightforward as I expected.
And now I convert all the data to hdf.
- First, this usually results in less (about 2-3 times) space needed to store data (but that depends on the dataset, for some ot them there is no difference between csv and hdf).
- Second, your data now stored in binary format, so all the types are strictly defined - no parsing is needed, no guessing of types
- Thus, read/write operations are orders of times faster
- And no approximations are made for float numbers.
Hope, there are enough arguments to move to hdf.
No comments :
Post a Comment